Introduction to Pandas
1.4.2. Introduction to Pandas#
As mentioned in the introduction in Programming Language and Libraries, the datascience library is designed as an educational tool, to help students learn data science concepts.
The text Computational and Inferential Thinking [ADW21] uses this library.
As much as possible, we use this text for our readings, but it does not have everything that we need.
Out in the real world, the
Pandas python data analysis library is used instead.
Consequently, the other texts we use are based on pandas, as are plotting libraries and other libraries that we will need.
So:
You will need to know how to convert a
datascience.Tableinto apandasobject so that other libraries can use it.You will need to know enough
pandasto understand the readings that use it.
The datascience library includes conversion methods:
Table.from_df converts a
pandas.DataFrameinto adatascience.Table, as intbl=Table.from_df(df).Table.to_df converts a
datascience.Tableinto apandas.DataFrame, as indf=tbl.to_df().
You can load pandas into a python session with
import pandas as pd
Our readings are from Learning Data Science [LGN23] and mostly teach us how to use pandas to do the things we already learned to do with datascience in Section 1.3.2.
Reading Questions
If
mydatais apandasDataFrame, then here are some things we can do with it:mydata['Hometown']mydata[mydata['Score'] <= 20]mydata['Major'] = ['Art','Biology','Criminology','Math']mydata.groupby('Hometown')['Score'].sum()mydata[['Hometown','Major']]mydata.join(majorcodes,on='Major',how='inner')mydata['Hometown'].apply(len)mydata.sort_values('Hometown',ascending=False)
Now suppose
mydatais adatascienceTable. How would you accomplish the same things?If
mydatais adatascienceTable, then we have learned the following methods:mydata.column("Name")mydata.with_column("Age",agearray)mydata.select("Name","Favorite Color")mydata.sort("Age")mydata.where("Age",are_above(30))mydata.apply(abs,"Age")mydata.group("Hometown")mydata.join("Hometown",stateindex,"City")
Now suppose
mydatais apandasDataFrame. How would you accomplish the same things?
Further Resources
From the Python Data Science Handbook [Van16]
From the pandas project: