The Internet abounds with data that are stored and exchanged in many different formats. In this chapter our aim has been to give you a flavor for the variety of formats available and a basic understanding of how to acquire data from online sources and services. We have also addressed the important goal of acquiring data in a reproducible fashion. Rather than copy-and-paste from a web page or complete a form by hand, we have demonstrated how to write code to acquire data. This code gives you a record of your work flow and of the data provenance.
With each format introduced, we described a model for its structure. A basic understanding of a dataset’s organization, helps you uncover issues with quality, mistakes in reading a source file, and how best to wrangle and analyze the data. In the longer run, as you continue to develop your data science skills, you will be exposed to other forms of data exchange, and we expect this approach of considering the organizational model and getting your hands dirty with some simple cases will serve you well.
Web etiquette is a topic that we must mention. If you plan to scrape data from a Web site, it’s a good idea to check that you have permission to do so. When we sign up to be a client for a web app, we typically check a box indicating our agreement to the terms of service.
If you use a Web service or scrape web pages, be careful not to overburden the site with your requests. If a site offers a version of the data in a format, like CSV, JSON, or XML, then it’s better to download and use these than scrape from from a web page. Likewise, if there is a Python library that provides structured access to a web app, then use it rather than write your own code. When you make requests, start small to test your code, and consider saving the results so you don’t have to repeat requests unnecessarily.
The aim of this chapter isn’t to make you an expert in these specific data formats. Instead, we wanted to give the confidence needed to learn more about a data format, to evaluate the pros and cons of different formats, and to participate in projects that might use formats that you haven’t seen before.