Working with Text

13. Working with Text

Lots of data resides not as numbers but as text: names of dog breeds, restaurant violation descriptions, street addresses, web logs, books, documents, blog posts, and Internet comments. To organize and analyze the information contained in text, we often need to do the following tasks:

  • Convert text into a standard format. This is also referred to as canonicalizing text. For example, we might need to convert characters to lower case, use common spellings and abbreviations, remove punctuation and blanks;

  • Extract a piece of text to create a feature. As an example, a string might contain a date embedded within it and we want to pull it out from the string to create a date feature.

  • Transform text into a feature. We might want to create a 0-1 variable to indicate whether or not a string contains certain words.

  • Text Analysis. In order to compare entire documents at once, we can transform a document into a vector of word counts.

This chapter introduces common techniques for working with text data. We show how simple string manipulation tools are often all we need to put text in a standard form or extract portions of strings. We’ll also introduce regular expressions for more general and robust pattern matching. To make these text operations more concrete, we begin by introducing several examples that need string manipulations or regular expressions to prepare text for analysis.