Introduction#

Introductory Data Science (IDS) is an emerging field that uses methods from statistics, mathematics, and computer science to find and communicate meaning in data. Due to the rapidly increasing role of data in commerce and science, data science has gained wide attention in a relatively short amount of time in the private sector, government, and academia. In the private sector, job advertisements for data scientists have proliferated. ….

The goal of this course is not to transform each student into a data scientist, but to give the student a sense of data literacy. That is, the ability to collect, analyze and derive meaningful information from data. This course does not require prior mathematical, statistical, or programming skills. ….

Topic Areas:

  1. Curation of Data – Data management and curation is a first step in IDS. This includes acquiring data from diverse formats and structures, cleaning data to prepare for data analysis and maintaining and sharing data files in a version control system.

  2. Enhanced Data Visualization – Statistics and IDS often tell a story of data through informative pictures. Enhanced data visualizations go beyond the common graphs in introductory statistics to describe, explore and communicate insights from data.

  3. Statistical Models, Estimation, and Prediction – Determining a functional relationship between numerical data and numerical/categorical data is at the center of statistical modeling in IDS. The focus is on using statistical models to describe relationships between variables, discerning between modeling for predictions and modeling for inference, and fitting, evaluating, and interpreting statistical models. Students familiar with statistical inference from introductory statistics are unavoidably sheltered from the mathematical rigor behind each hypothesis test. Simulation based probability and inference provides straightforward methods to make decisions with data without working through the mathematical details that underlie traditional statistical inference. The quality of a prediction model in terms of being able to forecast an expected outcome is measured through a loss function.

  4. Applications of Data Science – Machine learning and statistical learning. Machine learning is the process of developing computer algorithms to search for and recognize patterns in data. Statistical learning additionally uses the expertise of a human analyst to craft appropriate models. There are two branches in machine/statistical learning: supervised learning and unsupervised learning.

  5. Consumer of Data Science – The importance of ethical data science practice is critical to the validity of results in any IDS course. This includes problem-solving and the use of ‘big data’ which can accentuate cultural biases and differences and discussion of issues involving privacy, data security and societal impact.

[OhioDoHEducation21]

About this book#

This book is designed to satisfy TMM026-Introductory Data Science [OhioDoHEducation21], which is the Ohio Department of Higher Education’s standard for transferability under Ohio Transfer 36. Each section here corresponds to the Learning Outcome with the same number. Sections 3.4, 3.5, 3.6, and 4.4 are considered optional under TMM026 and are marked with a * in their title.

Typically, someone writes a book because either

  • they think they have something new and important to say, or

  • they think they can explain better than existing books.

Neither are true for this book. Instead, the goal is to organize existing high-quality materials to support a course satisfying TMM026. We rely on publicly available material from several sources and would like to particularly acknowledge [ADW21, BKH21, LGN23].

What is Data Science and why should you care?#

As stated above, others have already explained this well, so we will use what they have written.

Our first reading, from Computational and Inferential Thinking [ADW21], gives a quick what and why explanation.

Our second reading, from Learning Data Science [LGN23], walks through the big picture of what one does in data science.

Programming Language and Libraries#

We use Python as the programming language, but do not assume the reader has any prior knowledge of Python or any programming experience. Some of the suggested readings contain examples using the R language, but the reader does not need to understand how those work.

As much as possible, we use the python datascience library, which was created for the DATA 8 The Foundations of Data Science course at the University of California at Berkeley.

The datascience package is an open source Python package that helps make programming more accessible to all students, regardless of background. As a pedagogical aid, the package is designed to help students more intuitively conduct data science techniques without first spending considerable time directly learning more complex tools such as pandas or matplotlib. At Berkeley, these other packages are introduced in further upper-division coursework such as Data 100.

source

When needed, we also use the python pandas library. The Pandas data analysis library is heavily used in practice, so several of our sources describe data science concepts using pandas code.

Acknowledgements#

  • Shadrack Afful Mensah contributed by evaluating material from [Beu21] and [LGN23] for inclusion.

  • Moayad Odeh contributed by evaluating material from [Van16] for inclusion.

  • Gordon Nsiah contributed by finding materials to include in Section 5.

Bibliography#

ADW21(1,2)

Ani Adhikari, John DeNero, and David Wagner. Computational and Inferential Thinking: The Foundations of Data Science. Self published, 2021. URL: https://inferentialthinking.com/.

BKH21

Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton. Modern Data Science with R. Self Published, 2nd edition, 2021. URL: https://mdsr-book.github.io/mdsr2e/.

Beu21

Tomas Beuzen. Python Programming for Data Science. Self-published, 2021. URL: https://www.tomasbeuzen.com/python-programming-for-data-science/README.html.

LGN23(1,2,3)

Sam Lau, Joey Gonzalez, and Deb Nolan. Learning Data Science. Self published, 2023. To be published with O'Reilly Media in 2023. URL: http://www.textbook.ds100.org/.

Van16

Jake VanderPlas. Python Data Science Handbook. O'Reilly Media, Inc., 2016. URL: https://jakevdp.github.io/PythonDataScienceHandbook/.

OhioDoHEducation21(1,2)

Ohio Department of Higher Education. TMM026-Introductory Data Science. 12 2021. URL: https://www.ohiohighered.org/sites/default/files/uploads/transfer/policy/Introductory%20to%20Data%20Science%20Learning%20Outcomes%20%2812.3.21%29.pdf.