Web scraping: Reading documents

Because we spent more time than I had expected on chapters 3, 4 and 5 in Mitchell, we will not be going over chapter 6 in class. This post serves to highlight some valuable parts of that chapter.

  1. Scraping plain-text files: Can be done, but in some ways it is harder than scraping HTML files. Don’t use BeautifulSoup with plain text, because BeautifulSoup is meant for HTML. It doesn’t do anything useful to a plain-text file.
  2. Character encoding: Great information here that’s pertinent to foreign languages and the meta charset tag you all know and love. Also pertinent to the utf8mb4_unicode_ci collation setting we learned about in chapter 5. I strongly recommend you read pages 94–98, because it will make you smarter about how the Web works!
  3. CSVs: Very important stuff if you are ever going to scrape and read any CSV files.
  4. PDFs: Yes, you can scrape data out of PDF files! There’s a Python library for that.
  5. Microsoft Word .docx files: The short version is, they suck. However, they can be scraped. Details in the chapter.

As for the rest of Mitchell’s book — which we will not cover in class — it includes details for scraping just about anything you might ever want to scrape. Even JavaScript files! Check out the remaining chapter titles and make a mental note of what you might need later on.

Appendix C covers ethics and legal prohibitions for Web scraping, which we touched on only briefly in class.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s