Data Wrangling 101: Pandas in Action

📓 Notes:

https://github.com/ua-datalab/Workshops/wiki/Data-Wrangling-101:-Pandas-in-Action

💻 Code: https://github.com/ua-datalab/Workshops/blob/main/Data_Wrangling_101_Pandas_in_Action/Data_Wrangling_101_Pandas_in_Action.ipynb

📹 Link to Zoom recording (10-10-2023)

Pandas Library

(Image Credit: Wikimedia Commons, CC)

Pandas is a Python Library designed by Wes McKinney for data manipulation and data analysis. Pandas basic data structures objects are 1-dimensional Series and 2-dimensional DataFrame.

Pandas can read a wide variety of data formats, such as comma separated values (csv), Microsoft Excel files, JSON, SQL tables and queries, and more. Pandas is a base tool for data cleansing and data wrangling.

Learning Objectives

Introduction to Pandas data structures
How to create a DataFrame data structure
Reading input data files into a DataFrame
DataFrame indexing
Working with text and categorial variables
Basic operations on a DataFrame
How to select rows or columns from a DataFrame
How to add or delete rows or columns in a DataFrame
How to merge, join, concatenate and compare DataFrames

Please see Slides: tinyurl.com/UAD7-PythonDataAnalysis

The Pandas Python Library was developed in 2008 by Wes McKinney, for performing data manipulation and analysis. Pandas uses data structures and functions for manipulating numerical tables and time series.

What is a DataFrame? It is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure.

dataframe

We will again use Google Colab where the Pandas is included in the available software libraries like NumPy and Matplotlib used in the previous week.

Some basic operations on a Pandas DataFrame df.

Function	Action
pd.read_csv(filename)	Reads a CSV (comma separated values) file
pd.read_excel(filename)	Read an Excel file
	Reading other file formats
pd.to_csv(filename)	Write the dataframe to a file
df.head()	Shows the first 5 rows by default. If you wish to print n rows use df.head(n)
df.tail()	Shows the last 5 rows by default
df.shape	Prints the dataframe dimensions (rows, columns)
df.info()	Prints out dataframe information: number of rows, columns, names, data types, number of non null entries and more
df.describe()	Returns a statistical analysis of float variables
df['categorical variable'].describe()	Describes how many they are and how many are different
df['categorical variable'].value_counts().head(10)	Counts the number of occurrences of each categorial variable and shows the first 10
df.columns	Prints the names of the columns
df.columns = ['col1','col2','col3']	It names the columns according to list
df.rename(columns={'Old1' : 'New1', 'Old2' : 'New2'}, inplace=True)	Renames some of the columns
df.drop_duplicates(inplace=True)	Eliminates repeated rows
df.isnull().sum()	Returns the sum of missing values in each variable
df.dropna()	Will eliminate rows having at least one null value
df.dropna(axis=1)	Will eliminate columns having at least one null value
df.mean()	Computes the arithmetic mean of the dataframe
df.fillna(x_mean, inplace=True)	Will replace missing values with given mean value
df.corr()	Show the correlation between variables

Selecting information from a Pandas DataFrame

Function	Action
df['B']	Selects column 'B'
df[ ['A','B'] ]	Selects columns 'A' and 'B'
df_new = df[ ['A','B'] ]	Creates a new dataframe df_new composed by two selected columns of df
df['C'] = df['A'] + df['B']	Creates a new column in df, being the sum of columns 'A' and 'B'
df_copy = df.copy()	Creates a new dataframe copy of existing df
df.drop('D', axis=1, inplace=True)	Eliminates column 'D' and redefines df
df.loc['2']	Returns df row with index '2'
df.loc['2','C']	Returns the specific value of df, with index=2, and column='C'
df.iloc[2]	Returns row with index=2
df.loc['2':'4']	Returns rows with index '2','3', and '4'
df.iloc[2:4]	Returns rows with index '2' and '3', excludes '4'
df[df['B'] == 5.0]	Selects rows where the condition df['B'] equals 5.0
df[(df['B'] == 5.0) & (df['D'] <= 2.0)]	Selects rows that satisfy both conditions simultaneously

The standard operators for comparing two values and conditional clauses.

Operators
Comparison	'<', '<=', '==', '!=', '>=', '>',
or their wrappers	'.lt()', '.le()', 'eq()', '.ne()', '.ge()', '.gt()'
Conditionals	& (and), \| (or)

Jupyter Notebook Examples

Reading data files in Jupyter Notebooks
Introduction to Pandas
More examples
Examples from Python Data Science Handbook. Jake VanderPlas.

Available resources

Datasets

We list a small collection of repositories having available datasets for practicing data analysis.

Pandas Cheat Sheets

General References.

Pandas Tutorials
Pandas Cookbook
Pandas User Guide
Python for Data Analysis, 3rd. Edition. Wes McKinney (O'Reilly Media, 2021).
Python Data Science Handbook. Jake VanderPlas (O'Reilly Media, 2016).

Created: 01/30/2022 (C. Lizarraga); Last update: 10/09/2023 (B. Huppenthal)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Wrangling 101: Pandas in Action