Scraping the CIA World Factbook

It seems the code on the country pages for the CIA World Factbook is very poorly formatted. Just take any URL for a country page and run it through the validator — yikes!

Changes in the past year have made the HTML so error-ridden that BeautifulSoup can’t parse the page properly if we use the HTML parser included in Python’s standard library. However, the beauty of Python is that there are lots of libraries for everything, and we can install and use a more “lenient” parser.

This requires two simple steps.

First, go into your scraping directory in Terminal and activate the virtualenv.

Step 1: Install the html5lib parser this way:

$ pip3 install html5lib

Step 2: In your code, change the line that says:

bsObj = BeautifulSoup(html, "html.parser")

To this:

bsObj = BeautifulSoup(html, "html5lib")

That’s everything! Your scrape will now work!

I figured this out by reading this part of the BeautifulSoup documentation.

I updated the gist linked to your homework. It now has a current scrape file at the bottom, named country2.txt.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s