1.4.3. Data Wrangling#

The data we collect (as in Section 1.2) may not be in a usable form. It may have errors, missing values, etc. that need to be cleaned before it can be organized (as in Section 1.3.2). The process of cleaning and organizing data is sometimes called data wrangling.

Our first reading, from Learning Data Science [LGN23], goes through this data wrangling process of cleaning and organizing data.

Reading Questions

  • If one bit of data is missing from a row (say, a person’s age), what can you do about it?

  • What is imputation?

Our second set of readings are about outliers, which are points in the data that seem much different from the rest.

Reading Questions

  • If your data file says that someone’s age is 237, what should you do?

  • If your data file says that someone’s income is 100 times the next largest income, what should you do?

  • What is so special about C-3PO?