1.2. Collecting Data#

Learning Outcome

Students will be able to acquire raw data from a variety of sources.

Sample Tasks:

  • Distinguish between different sources of data such as relational database, automated data collection, and online surveys.

  • Discern between structured data sources, sources that are searchable such as relational databases, and unstructured data sources, sources that are not searchable such as social media and text messages.

  • Collect data from open or public data sources such as data.gov, IPUMS, Kaggle, Quandl, The World Bank, US Census Bureau, NASA, Amazon Web Services or Google Cloud Platform.

  • Convert a file from its present format into a format that is prepared for analysis.


Our first reading, from Learning Data Science [LGN23], gives some examples of different types of data available and the formats they might be in.

Reading Questions

  • What does CSV stand for?

  • Why doesn’t everyone use the same file format?

Rather than reading more text on different data formats, spend a few minutes (or a few hours) exploring what is available on the internet. Some places to look:

Reading Questions

  • What was the most common file format you found?

  • What was the strangest file format or type of data you found?

  • What was the most interesting data you found?