Introduction to Pandas
1.4.2. Introduction to Pandas#
As mentioned in the introduction in Programming Language and Libraries, the datascience
library is designed as an educational tool, to help students learn data science concepts.
The text Computational and Inferential Thinking [ADW21] uses this library.
As much as possible, we use this text for our readings, but it does not have everything that we need.
Out in the real world, the
Pandas python data analysis library is used instead.
Consequently, the other texts we use are based on pandas
, as are plotting libraries and other libraries that we will need.
So:
You will need to know how to convert a
datascience.Table
into apandas
object so that other libraries can use it.You will need to know enough
pandas
to understand the readings that use it.
The datascience
library includes conversion methods:
Table.from_df converts a
pandas.DataFrame
into adatascience.Table
, as intbl=Table.from_df(df)
.Table.to_df converts a
datascience.Table
into apandas.DataFrame
, as indf=tbl.to_df()
.
You can load pandas
into a python session with
import pandas as pd
Our readings are from Learning Data Science [LGN23] and mostly teach us how to use pandas
to do the things we already learned to do with datascience
in Section 1.3.2.
Reading Questions
If
mydata
is apandas
DataFrame
, then here are some things we can do with it:mydata['Hometown']
mydata[mydata['Score'] <= 20]
mydata['Major'] = ['Art','Biology','Criminology','Math']
mydata.groupby('Hometown')['Score'].sum()
mydata[['Hometown','Major']]
mydata.join(majorcodes,on='Major',how='inner')
mydata['Hometown'].apply(len)
mydata.sort_values('Hometown',ascending=False)
Now suppose
mydata
is adatascience
Table
. How would you accomplish the same things?If
mydata
is adatascience
Table
, then we have learned the following methods:mydata.column("Name")
mydata.with_column("Age",agearray)
mydata.select("Name","Favorite Color")
mydata.sort("Age")
mydata.where("Age",are_above(30))
mydata.apply(abs,"Age")
mydata.group("Hometown")
mydata.join("Hometown",stateindex,"City")
Now suppose
mydata
is apandas
DataFrame
. How would you accomplish the same things?
Further Resources
From the Python Data Science Handbook [Van16]
From the pandas project: