Scraping the CIA World Factbook

It seems the code on the country pages for the CIA World Factbook is very poorly formatted. Just take any URL for a country page and run it through the validator — yikes!

Changes in the past year have made the HTML so error-ridden that BeautifulSoup can’t parse the page properly if we use the HTML parser included in Python’s standard library. However, the beauty of Python is that there are lots of libraries for everything, and we can install and use a more “lenient” parser.

This requires two simple steps.

First, go into your scraping directory in Terminal and activate the virtualenv.

Step 1: Install the html5lib parser this way:

$ pip3 install html5lib

Step 2: In your code, change the line that says:

bsObj = BeautifulSoup(html, "html.parser")

To this:

bsObj = BeautifulSoup(html, "html5lib")

That’s everything! Your scrape will now work!

I figured this out by reading this part of the BeautifulSoup documentation.

I updated the gist linked to your homework. It now has a current scrape file at the bottom, named country2.txt.


Scraping a random link

What is Mitchell’s program on p. 34 doing? (It is in her Chapter 3 files.)

  • Seeds a random number (using datetime)
  • Gets all the local links from the Kevin Bacon Wikipedia page
  • Stores them in the variable named links
  • Then this loop:
    while len(links) > 0:
       newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
       links = getLinks(newArticle)
  • That while-loop randomly selects one href attribute in the array named links
  • It prints only the new link
  • It gets all the local links from the NEW Wikipedia page and ADDS THEM to the array links
  • It loops back and does this again until there are no links left — which might take a VERY LONG TIME

So if the links were written to a file, one per line, the file would keep getting longer and longer.

Note that even with a seed, these sequences are pseudo random, not the same as truly random, because if you start with the same seed, the exact sequence will be repeated.

One last regex example

I wanted to find a real-life list I could use to reinforce the last thing I told you about regex. Here is a screen capture from the Pythex regex editor (click for full-size image):


The full data set would be a list of all the basketball players in the NBA that I scraped from somewhere. In the “test string,” I only pasted in nine lines to serve as my test data.

Also, I clicked MULTILINE — very important when you want the regex string to bring back every line that matches your criteria.

Like in class, I want to get only the point guards (indicated by PG). I want to get the complete line for each point guard, so I must make sure the green highlights the entire line.

My regex string: ^(.)*(, PG)(.)*$

Starting with ^ and ending with $ ensures that I’ll get the complete line.

(.)* means any characters, and any number of characters, except a newline. It is in my string twice — at the beginning, and at the end.

(, PG) means I want those exact four characters, together, in order, to be in the line. Yes, a space is a character. If any line has more than one space between the comma and PG, I won’t get that line.

The green highlighting tells me my regex is good: It has all the point guards and no one else.

Links to Python regex resources are on the Course Schedule under Week 7.

Some assignment changes, weeks 7 and 8

We didn’t cover all I had hoped to in class on Feb. 21. As a result, I have moved the Mitchell homework (Assignment 6) to one week later. That means the only thing due on Friday, Feb. 24, is the Python homework (Assignment 7).

Assignment 6 will be available on Feb. 25, but I think you might want to wait until after class Feb. 28 to start it. It is due Friday, March 3. I am open to giving you an extension because I will not be grading this until the following week. I will leave the assignment open and “submittable” until March 7.

The only other thing due on March 3 is your scraping proposal. That assignment is open now, so you can get a look at it and start thinking, and you can ask questions about it in class next Tuesday.

The actual scraping project is due Friday, March 24.

Your final Python homework will be due after Spring Break, on March 17.

Last but not least, here is the “prompt” example I showed in class.



Working in Python 2 and 3

Here’s what to keep straight in your mind as you do Python assignments:

Zed (LPTHW) is all Python 2. Those of you on Mac OS: In any directory on your computer, when you type python in Terminal, you will launch Python 2.7.x. No virtualenv for this.

Mitchell and all scraping exercises use Python 3. In Terminal, navigate into your scraping directory. There, activate your virtual environment:

source env/bin/activate

BeautifulSoup is already installed there. You can run any Python file there, and it will run under Python 3.6.x, not 2.7.x. To run a file named, for example:

(env) myname scraping$ python

Remember, you don’t start Python to run programs that are in files.

When you get an error, first ask yourself: Am I in the correct Python environment? 2 or 3?

Second, make sure you are not in Python when you are trying to run a file. When you see >>> you are in Python. To run a file, you need to be at the bash prompt: $

To quit using the virtualenv:

(env) myname scraping$ deactivate

Tips for Assignment 2

I suggest you read the rubric in Canvas (“Assignment 2 – database app”) first of all. Like all rubrics, it lists what you are expected to do.

Then you’ll read through the Assignment 2 document. You already know you need to create a fresh new database. You also have to create several files to perform the tasks described in the Assignment 2 document.

  1. Think about your database. What should its name be? What should the table name be? What are the column names? What are their correct data types (INT, TEXT, etc.)? Remember that you are creating a public-facing database, not one that each person can use privately. Also, don’t reinvent the wheel (example: do not make IMDb).
  2. Think about the way the contents of the database should look when they are all displayed on a web page (example). You probably need to use a table. You might want to design and code the page with dummy text in the table first — and no PHP — so you will know what it’s going to look like.
  3. Think about the form that the user has to fill in. Remember that enter_new_record.php (see it live) in the sockmarket repo does everything you need — but in your form, the names of the input elements will not be the same as the ones in enter_new_record.php. (You are not selling socks!) Also remember that using a group of checkboxes would make a lot of extra work for you, so don’t do that. Radio buttons or a select menu will be okay, because there the user can only choose one thing. If you want to use all input text fields in your form, that’s fine.
  4. Think about the index.html page, where the user learns what the database is and what they can do with it. Open index.html (in the sockmarket folder) in your browser to see what I mean.

That’s the project: a database, an index.html page, a form the user can fill in (which writes to the database), and a page that reads the database and writes it out to a web page. By the way, that last one is what we saw in read_db.php, the first file explained in “MySQL and PHP: Next Steps.” Remember that you need to use the mysqli_ commands to prevent a SQL injection attack on your database. Refer to the document “Using the Sock Market files” as well as pages 5–9 in “MySQL and PHP: Next Steps.”

How to work

I suggest you use XAMPP and get everything working there, on your own laptop, and then at the end transfer it all to your hosted website. (Otherwise you will have to keep uploading files again and again, ugh.) You know you can import the database table(s) from XAMPP to your live site — we did that in class.

The other part of transferring will be like what you did for Assignment 1. Go back and read those steps when you’re ready to copy your database app to your website. Remember you will have to make some changes to the database.php file. Make sure your database on the live website has a username (NOT root) and password!

Everything you need should be in the Assignment 2 document or linked in it. You definitely should borrow heavily from the sockmarket repo, but RESIST the temptation to copy whole files. In particular, you MUST NOT copy the whole sockmarket repo, because it includes a lot of files you will not need for your database! I do not want to see those files in your GitHub repo!

As for GitHub: Yes, you MUST make a new repo. Use the GitHub Desktop app, click the “Add a Repository” button (upper left), and make sure to select ADD, not Create, so you can choose the folder in XAMPP htdocs that contains your project. You can do this at the end, or you can do it when you begin work and make multiple commits as you go along, which is how professional coders do it. You do NOT need a gh-pages branch. When you’re done, remember to commit and Publish or Sync so that I can see your final files.

The Khan Academy SQL review

Here’s a link to the SQL review slides, with UPDATE, DELETE and transactions: Modifying databases with SQL. You do not need to use any of those things in Assignment 2! The other two SQL reviews are linked here.

About the Python stuff listed under Week 5

Look at the Course Schedule page for Week 5. Notice there is NO QUIZ. Hooray! However, you need to get started on the Python things before class. Do the listed work, and we’ll continue and/or discuss in class on Feb. 7. You WILL have the assignment listed there, due Friday, and you WILL be finishing any leftover work from Assignment 2 before Friday.

Note also that NOW is the time to get the scraping book if you don’t have it yet.

Followup to class Jan. 24

Here is the document we were using in class on Jan. 24 (Week 3): MySQL and PHP: Next Steps. To review what we covered:

We downloaded and looked at a gist, read_db.php, which you added to your existing shoutbox project. I explained what all the PHP in the file does and showed you how to alter it so you can change the way the information from the database appears on the web page (in the HTML). All the changes I showed you were in lines 22–37, where a PHP while loop writes each row from the database query into the HTML. This one file does it all — except that it does REQUIRE the additional file database.php, which you already had from Week 2.

That is all going to be helpful for you in Assignment 2.

We then opened my sockmarket repo and talked about a form page that will write to a database. The HTML is in simple_form.php (see the repo for the file), and the PHP is in another file, simple.php.

As explained in MySQL and PHP: Next Stepssimple_form.php includes some constraints inside the HTML form tags to prevent people from typing more than you want them to type in the form fields. The form includes an HTML form select menu so that people can’t just type anything in one of the fields (for sock style, e.g. knee-high). Each form field’s tag includes the required attribute so that people must fill it, or else the form cannot be submitted. Making each field required is recommended!

In the file simple.php, we looked at the PHP code that takes the form data after it is submitted and writes it into a new row in the existing table named socks. (The database: sockmarket. The table: socks.) This is also explained in MySQL and PHP: Next Steps.

Exactly like your shoutbox, simple.php relies on database.php (a separate file) to connect to the database (in this case, the database is sockmarket and not shoutbox). Exactly like in your shoutbox, database.php stores the information about the database in a PHP variable named $conn — and thus a lot of the code is identical if you compare your shoutbox project to my sockmarket project.

Part of your Assignment 2 requirement is to provide a form that lets the user fill the form and add a new row to the database. Together, the sockmarket files simple_form.php and simple.php do that.

The way the file simple.php is set up, SQL prepared statements are NOT used to prevent a SQL injection attack on your databases. It is VERY IMPORTANT to use these correctly to protect your data! We are going to go over the prepared statements when we continue with the document MySQL and PHP: Next Steps next week, starting on page 5 with the file enter.php.

Remember that the good resources to learn more about the mysqli_ commands are linked under Week 3 on the Course Schedule page, and you should use those resources instead of randomly Googling.

Also remember that lines 9–13 in simple.php depend on the names you used in the HTML tags for your form fields. The text that comes after $_POST[' must match the name of a particular field in your HTML form.

Here are my two slide decks to review the Khan Academy SQL lessons up to now:

Plan to attend NICAR 2017

What is NICAR? It’s the world’s biggest and best gathering of news nerds!

Who’s a news nerd? All the folks who write code and/or analyze data for journalism and storytelling.

NICAR is a four-day conference, and this year it will be in Jacksonville, Florida, March 2–5. The UF College of Journalism and Communications and the Knight Chair will sponsor up to 12 students to attend all four days. We have already reserved rooms at the conference hotel. There are tons of hands-on workshops led by professionals. All kinds of skills are covered: basic and advanced Excel, Python and Ruby programming, D3, data visualization, mapping, and more.

Students can join IRE (the parent organization of NICAR) for only $25/year, which gives you access to all kinds of resources and a very helpful Listserv where professional journalists post questions and answers daily about journalism code.

If you would like to apply to attend this amazing conference, please fill and submit this form. Current UF students only!