Because we spent more time than I had expected on chapters 3, 4 and 5 in Mitchell, we will not be going over chapter 6 in class. This post serves to highlight some valuable parts of that chapter.
- Scraping plain-text files: Can be done, but in some ways it is harder than scraping HTML files. Don’t use BeautifulSoup with plain text, because BeautifulSoup is meant for HTML. It doesn’t do anything useful to a plain-text file.
- Character encoding: Great information here that’s pertinent to foreign languages and the
meta charsettag you all know and love. Also pertinent to the
utf8mb4_unicode_cicollation setting we learned about in chapter 5. I strongly recommend you read pages 94–98, because it will make you smarter about how the Web works!
- CSVs: Very important stuff if you are ever going to scrape and read any CSV files.
- PDFs: Yes, you can scrape data out of PDF files! There’s a Python library for that.
- Microsoft Word .docx files: The short version is, they suck. However, they can be scraped. Details in the chapter.
Appendix C covers ethics and legal prohibitions for Web scraping, which we touched on only briefly in class.