1.4.2. Introduction to Pandas#

As mentioned in the introduction in Programming Language and Libraries, the datascience library is designed as an educational tool, to help students learn data science concepts. The text Computational and Inferential Thinking [ADW21] uses this library. As much as possible, we use this text for our readings, but it does not have everything that we need.

Out in the real world, the Pandas python data analysis library is used instead. Consequently, the other texts we use are based on pandas, as are plotting libraries and other libraries that we will need. So:

  • You will need to know how to convert a datascience.Table into a pandas object so that other libraries can use it.

  • You will need to know enough pandas to understand the readings that use it.

The datascience library includes conversion methods:

  • Table.from_df converts a pandas.DataFrame into a datascience.Table, as in tbl=Table.from_df(df).

  • Table.to_df converts a datascience.Table into a pandas.DataFrame, as in df=tbl.to_df().

You can load pandas into a python session with

import pandas as pd

Our readings are from Learning Data Science [LGN23] and mostly teach us how to use pandas to do the things we already learned to do with datascience in Section 1.3.2.

Reading Questions

  • If mydata is a pandas DataFrame, then here are some things we can do with it:

    • mydata['Hometown']

    • mydata[mydata['Score'] <= 20]

    • mydata['Major'] = ['Art','Biology','Criminology','Math']

    • mydata.groupby('Hometown')['Score'].sum()

    • mydata[['Hometown','Major']]

    • mydata.join(majorcodes,on='Major',how='inner')

    • mydata['Hometown'].apply(len)

    • mydata.sort_values('Hometown',ascending=False)

    Now suppose mydata is a datascience Table. How would you accomplish the same things?

  • If mydata is a datascience Table, then we have learned the following methods:

    • mydata.column("Name")

    • mydata.with_column("Age",agearray)

    • mydata.select("Name","Favorite Color")

    • mydata.sort("Age")

    • mydata.where("Age",are_above(30))

    • mydata.apply(abs,"Age")

    • mydata.group("Hometown")

    • mydata.join("Hometown",stateindex,"City")

    Now suppose mydata is a pandas DataFrame. How would you accomplish the same things?