Mitchell’s chapter 4 discusses APIs and how to use them to get information from websites (or, more accurately, Web applications) that offer them.
It’s really a great first step, before you try to scrape, to simply search the name of the site you intend to scrape, plus API. For example, I put this into Google:
new york times api
And I got this. Who knew?
Sometimes you get something even better than the API: You find that someone has already done essentially what you want to do and shared it with the world. This post is a good example of that:
Facebook does not give us much help for any kind of true scraping, even though it offers tons of APIs. The linked post won’t help you scrape other people’s Facebook Pages, but if you work for a publisher, you can use this technique to get raw data about your own Page(s), and you can use the raw data to do way more analysis than Facebook makes possible with its tools for publishers. This gives you super powers. Yay, code!
Things to try
Ask Google: what is my ip address
Copy that IP address into this URL, replacing the IP address that’s already there:
Paste the URL into your browser and view the tidy JSON data. (Remember JSON? You made a CSV of all your latlongs for map locations for the Leaflet assignment in Intro to Web Apps. You used Mr. Data Converter to change your CSV into JSON.)
Here’s another cool thing Mitchell suggests in chapter 4: Create a free account at The Echo Nest and then explore what you can do in the links on the left side of this page. Then check out the Python libraries for Echo Nest.
Nice people have already written Python “wrappers” for many APIs. However, use caution: Sometimes the publisher changes the API. After that, code written for the old version might not work. Sometimes it needs only a few tweaks, but sometimes it might really be useless.
Starting to use any API takes some time and effort. The payoff can be great, however, because learning to use the API will usually save you tons of time in the long run, as compared with writing code from scratch.