Data Wrangling
1.4.3. Data Wrangling#
The data we collect (as in Section 1.2) may not be in a usable form. It may have errors, missing values, etc. that need to be cleaned before it can be organized (as in Section 1.3.2). The process of cleaning and organizing data is sometimes called data wrangling.
Our first reading, from Learning Data Science [LGN23], goes through this data wrangling process of cleaning and organizing data.
Reading Questions
If one bit of data is missing from a row (say, a person’s age), what can you do about it?
What is imputation?
Our second set of readings are about outliers, which are points in the data that seem much different from the rest.
From Computational and Inferential Thinking [ADW21]:
7. Visualization section Scatter Plots gives an example where the average gross income per movie of one actor is exceptionally high.
15.1.7. Correlation is Affected by Outliers shows how correlation (which we will see in Section 3) can be affected by outliers.
From Learning Data Science [LGN23]:
10.3.2. One Qualitative and One Quantitative Variable illustrates how (suspected) outliers are shown as dots on bax plots.
11.1.1. Filling the Data Region illustrates how treating outliers separately yields better visualizations.
18.2. Wrangling and Transforming shows how to remove outliers (outlying donkeys) in preparation for modeling the relationships between variables.
Reading Questions
If your data file says that someone’s age is 237, what should you do?
If your data file says that someone’s income is 100 times the next largest income, what should you do?
What is so special about C-3PO?
Further Resources
From the Python Data Science Handbook [Van16]