Summary
14.6. Summary#
The internet abounds with data that are stored and exchanged in many different formats. In this chapter, our aim was to give you a flavor for the variety of formats available and a basic understanding of how to acquire data from online sources and services. We also addressed the important goal of acquiring data in a reproducible fashion. Rather than copying and pasting from a web page or completing a form by hand, we demonstrated how to write code to acquire data. This code gives you a record of your workflow and of the data provenance.
With each format introduced, we described a model for its structure. A basic understanding of a dataset’s organization helps you uncover issues with quality, mistakes in reading a source file, and how best to wrangle and analyze the data. In the longer run, as you continue to develop your data science skills, you will be exposed to other forms of data exchange, and we expect this approach of considering the organizational model and getting your hands dirty with some simple cases will serve you well.
We only touched the surface of web services. There are many other useful topics, like keeping connections to a server alive as you issue multiple requests or retrieve data in batches, how to use cookies, and making multiple connections. But understanding the basics presented here can get you a long way. For example, if you use a library to retrieve data from an API but run into an error, you can start looking at the HTTP requests to debug your code. And you will know what’s possible when a new web service comes online.
Web etiquette is a topic that we must mention. If you plan to scrape data from a website, it’s a good idea to check that you have permission to do so. When we sign up to be a client for a web app, we typically check a box indicating our agreement to the terms of service.
If you use a web service or scrape web pages, be careful not to overburden the site with your requests. If a site offers a version of the data in a format like CSV, JSON, or XML, it’s better to download and use these than to scrape from a web page. Likewise, if there is a Python library that provides structured access to a web app, use it rather than writing your own code. When you make requests, start small to test your code, and consider saving the results so that you don’t have to repeat requests unnecessarily.
The aim of this chapter wasn’t to make you an expert in these specific data formats. Instead, we wanted to give you the confidence needed to learn more about a data format, to evaluate the pros and cons of different formats, and to participate in projects that might use formats that you haven’t seen before.
Now that you have experience working with different data formats, we return to the topic of modeling that we introduced in Chapter 4, picking it back up in earnest.