Database terminology (and Python)

What is an ORM? ORM stands for object-relational mapper, or object-relational mapping tool. It’s a translational layer of code between your script and the database, and it converts the database data into objects your code can use easily. Think of the database as a collection of items that are encoded into a complex structure, like a maze with secret doors and hidden passages. An ORM translates the structure of the maze into a parallel structure that is usable by the language of your script or program — in our case, Python.

SQLAlchemy is an ORM for Python, but it’s not the only one.

Peewee is another ORM for Python.

When using an ORM, you need to know what kind of DBMS (or RDBMS)  you are using. DBMS stands for database management system, and that means the software that runs your database. The database itself is the collection of data. The common SQL DBMSes are SQLite, MySQL and PostgreSQL. Although SQLite is a DBMS, it’s quite different from others in that it is self-contained and does not use a server.

These various DBMSes are not interchangeable, and therefore, you need to know how to make your chosen ORM communicate with the DBMS you are using.

Python ORMs are discussed in detail — but quite clearly — here. That page includes links to many resources.


Working with SQL databases in Flask

Activate your virtualenv in the folder where you’ve been doing Flask projects.

Install the Flask-SQLAlchemy extension like this (use uppercase as shown):

$ pip3 install Flask-SQLAlchemy

That installs SQLAlchemy as well as Flask-SQLAlchemy. No need for a separate install.

Using a database with Flask

You’ll have to connect your app to an existing SQL database, whether the app reads from the database, writes to the database, or both. Connecting will require your own database username and database password. (You are the owner of the database.)

You can create the SQL database using Python, but that is not required. You might already have a database. Then all you need to worry about is how to connect it.

If you do use Python to create a SQL database (and that’s an “if,” not a necessity), you will only do it once. You don’t create the same database again and again.

Your database may have one table, or more than one table. That depends on what you need, or the structure of the existing SQL database.

Your app might only read from your SQL database. You can write SQL queries to accomplish this — use Flask-SQLAlchemy commands to do so.

Your app might allow people to write to your SQL database. In that case, it is possible you will want people to log in securely.

You might write a Python script to populate your database from the contents of a CSV file. This would be fairly simple if you only need to run it once. If you need to add records repeatedly (say, once per month) to an existing database, you might need to check whether you are duplicating records that are already there. If you need to check for existing records and update them, that’s more challenging.

If people are writing into your database, you will want to give them a form, or forms, for doing so. Remember that Bootstrap and Flask-WTF are your very best friends for that!

Make a checklist for your database project

Figure out what your own app will need to do:

  1. Create a database (optional; only once)
  2. Connect to the database (required)
  3. Read from the database: Query
    • Display all records?
    • Display a subset of records?
    • Search for multiple records?
    • Search for one record at a time?
  4. Write to the database
  5. Allow users to write to the database
    • Insert a new record?
    • Update an existing record?
    • Delete a record?

For all Python and SQL commands, refer to the links listed under “User’s Guide” in the Flask-SQLAlchemy documentation.

Type hints in Python

In the Barry book (Head First Python, 2nd edition), he always includes something known as “type hints” in his code. This is akin to making a note that a particular variable is intended to be a string, an integer, a Boolean value, etc.

Every time Barry specifies a variable type like so:

def greeting(name: str) -> str:
  return 'Hello, ' + name

You can OMIT the boldfaced parts. Those are the markup for type hints, and hardly anyone uses them in Python.

This works perfectly well (and is so much nicer):

def greeting(name):
  return 'Hello, ' + name

The version with no type hints is what you have learned up until now, and there’s nothing wrong with using it.

Type hints are completely unnecessary:

“This PEP introduces a provisional module to provide these standard definitions and tools, along with some conventions for situations where annotations are not available.”

“While these annotations are available at runtime through the usual __annotations__ attribute, no type checking happens at runtime.”

See also this Reddit post for more about how type hints are not needed and also not standard practice in Python.

What journalist developers think about

This is a thoughtful and fascinating article by Tyler Fisher, journalist and web developer at Subverting the Story Model. Here’s an excerpt:

Screen Shot 2017-03-31 at 4.03.49 PM

I may have told you what a huge fan I am of NPR’s live annotation system for speeches and debates — it’s also discussed in the article. It runs off a shared Google doc. So sweet.

What I like most about Fisher’s article is that he’s thinking about how to serve the public better, and that’s something all journalists should do more often.

Summary: Using virtualenv

A few students experienced some confusion over using a virtualenv this week. Let’s review:

“virtualenv is a tool to create isolated Python environments. virtualenv creates a folder which contains all the necessary executables to use the packages that a Python project would need.” (source)

You can think of it as a quarantine space: It is completely sealed off from the rest of your computer. When a virtualenv is activated, and you install things in it, those things are NOT AVAILABLE outside. When you are not IN that virtualenv, you cannot access those things (such as BeautifulSoup).

When we created our two separate, different virtualenvs in class (on different days), we installed different things in the two. So you have ONE virtualenv that is for scraping, and it contains the BeautifulSoup library. You have ANOTHER virtualenv — in another folder — that contains the Flask framework and its many dependencies.

Here is how we set up our two virtualenvs in class:

  1. We created a new folder for that set of projects.
  2. We cd’d into the new folder, using Terminal.
  3. While inside the folder, we entered this long command on one line:

virtualenv --python=/Users/username/Software/Python-3.6.0/mybuild/bin/python3 env

(In place of “username,” you typed your own username for your own laptop.)

That command created a new virtualenv running Python 3 (not 2). The long path in the long command refers to the exact Mac OS location where we installed Python 3. If you’re on Windows, or if you never did a safe Python 3 install on a Mac, you don’t have that. There are instructions for Windows at the end of this document (pages 5–6).

Note that by activating a virtualenv that was set up in this way, you are automatically running Python 3 (not 2) for all the things.

If you had wanted to create a Python 2 (not 3) virtualenv on a Mac, you could instead just type this:

virtualenv env

Note that in both cases, the command creates a new folder named env inside the folder you are currently in. If you wanted it to be named foobar, you would type: virtualenv foobar

For that reason, you must go inside the project folder before you activate the virtualenv with this command (it is different for Windows):

source env/bin/activate

Don’t go inside the env folder.

(If your folder was foobar instead of env, the command would be source foobar/bin/activate)

While the virtualenv is activated, you’ll see (env) at the left side of your Terminal prompt. While you see that, you have access to everything you installed in that virtualenv.

To install new Python packages, libraries, frameworks, etc., for the project, make sure the virtualenv is activated. This is what makes the “quarantine space” work: You are installing all the new things within the enclosed, sequestered space, and they will not affect anything on the rest of your system. That is the beauty of the virtualenv.

When you’re finished working, you can close down the virtualenv with one command:


My practice for handing projects, or a set of projects, is to create a new folder and name it something broad, like scraping (for all my web scraping projects). Then I go inside scraping, create a new virtualenv there, named env, and install what I need to do scraping projects — such as BeautifulSoup. For each scraping project, I create a new folder inside scraping and name the new one something that matches the project — such as soccer-players, for example.

When I wanted to start making some Flask projects, I created a new folder — outside scraping, because this is no longer about scraping work — and created a new virtualenv there, named env. I activated the virtualenv and then installed what I need to do Flask projects.

Other methods to create a virtualenv

Other people use their own systems to make virtualenvs. Some people always name theirs venv instead of env. Some people name their virtualenv after their project (for example, soccer-players), which confused me very, very horribly when I was starting to learn about virtualenvs.

I mention this because you will no doubt see tutorials where people use these other systems or methods. They all work — but YOU have to know what YOU’RE doing. YOU have to keep track of what you named your folder, your virtualenv, your project, etc.

In our class, I stick to my one way of creating a new virtualenv because it seems clear and simple to me. I didn’t get the hang of it right away, but now it seems quite natural to me. It’s consistent, and it keeps all my projects well organized.


Screen Shot 2017-03-22 at 10.19.59 AM

Above, I used a series of commands in Terminal to demonstrate that I am running Python 3.6.0 while I am in my virtualenv inside a directory named headfirst. When the virtualenv is deactivated, I am running Python 2.7.10.

Recap of scraping resources

First and foremost, hold on to the Mitchell book. It is a great resource for any scraping project you want to do in the future. Use the index before you go to Google.

We read chapters 1–6 before spring break. The rest of the chapters are more specialized. It doesn’t mean they’re harder — they just refer to things that go beyond basic everyday scraping (like how to scrape a site that requires you to log in; see page 142).

My Google slide decks based on Mitchell:

Other resources I provided for scraping are linked on the Course Schedule page under weeks 6, 7 and 8.

The one additional resource from me (not linked under those three weeks) is My first homemade Web scraper, which, as you know, is linked to an extensive GitHub repo. The CSV code is covered on pages 9 and 10. Remember that the code shown on page 10 is partial, although it includes all of the CSV parts.

API examples

My Wikipedia viewer code is here (an example of an API project). The live version is here.

Here is a different personal API project I did: Random Quote Machine.

My best tip for finding APIs to use is just search with Google and include “api” in your search. That’s how I found the APIs for The New York Times and for Google.

Update from Reclaim Hosting

Another great service from my favorite web hosting provider: We can all have secure websites for no extra charge!

Instructions here. What you’ll see if you follow the instructions precisely (take your time) is that all your site’s pages will default to https:// instead of http://

Other web hosting services (such as Dreamhost) charge you extra for this!

HTTP vs. HTTPS: What’s the Difference and Why Should You Care?

tl;dr —  By switching to HTTPS, you are making your website more secure.

Scraping the CIA World Factbook

It seems the code on the country pages for the CIA World Factbook is very poorly formatted. Just take any URL for a country page and run it through the validator — yikes!

Changes in the past year have made the HTML so error-ridden that BeautifulSoup can’t parse the page properly if we use the HTML parser included in Python’s standard library. However, the beauty of Python is that there are lots of libraries for everything, and we can install and use a more “lenient” parser.

This requires two simple steps.

First, go into your scraping directory in Terminal and activate the virtualenv.

Step 1: Install the html5lib parser this way:

$ pip3 install html5lib

Step 2: In your code, change the line that says:

bsObj = BeautifulSoup(html, "html.parser")

To this:

bsObj = BeautifulSoup(html, "html5lib")

That’s everything! Your scrape will now work!

I figured this out by reading this part of the BeautifulSoup documentation.

I updated the gist linked to your homework. It now has a current scrape file at the bottom, named country2.txt.


Scraping a random link

What is Mitchell’s program on p. 34 doing? (It is in her Chapter 3 files.)

  • Seeds a random number (using datetime)
  • Gets all the local links from the Kevin Bacon Wikipedia page
  • Stores them in the variable named links
  • Then this loop:
    while len(links) > 0:
       newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
       links = getLinks(newArticle)
  • That while-loop randomly selects one href attribute in the array named links
  • It prints only the new link
  • It gets all the local links from the NEW Wikipedia page and ADDS THEM to the array links
  • It loops back and does this again until there are no links left — which might take a VERY LONG TIME

So if the links were written to a file, one per line, the file would keep getting longer and longer.

Note that even with a seed, these sequences are pseudo random, not the same as truly random, because if you start with the same seed, the exact sequence will be repeated.