Skip to content

Data Wrangling 101: Pandas in Action

Carlos Lizarraga-Celaya edited this page Aug 16, 2024 · 10 revisions

Data Wrangling 101: Pandas in Action


📓 Notes:

https://github.com/ua-datalab/Workshops/wiki/Data-Wrangling-101:-Pandas-in-Action

💻 Code: https://github.com/ua-datalab/Workshops/blob/main/Data_Wrangling_101_Pandas_in_Action/Data_Wrangling_101_Pandas_in_Action.ipynb

📹 Link to Zoom recording (10-10-2023)

Pandas Library

(Image Credit: Wikimedia Commons, CC)


Pandas is a Python Library designed by Wes McKinney for data manipulation and data analysis. Pandas basic data structures objects are 1-dimensional Series and 2-dimensional DataFrame.

Pandas can read a wide variety of data formats, such as comma separated values (csv), Microsoft Excel files, JSON, SQL tables and queries, and more. Pandas is a base tool for data cleansing and data wrangling.


Learning Objectives

  • Introduction to Pandas data structures
  • How to create a DataFrame data structure
  • Reading input data files into a DataFrame
  • DataFrame indexing
  • Working with text and categorial variables
  • Basic operations on a DataFrame
  • How to select rows or columns from a DataFrame
  • How to add or delete rows or columns in a DataFrame
  • How to merge, join, concatenate and compare DataFrames

Please see Slides: tinyurl.com/UAD7-PythonDataAnalysis


The Pandas Python Library was developed in 2008 by Wes McKinney, for performing data manipulation and analysis. Pandas uses data structures and functions for manipulating numerical tables and time series.

What is a DataFrame? It is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure.

dataframe

We will again use Google Colab where the Pandas is included in the available software libraries like NumPy and Matplotlib used in the previous week.

Some basic operations on a Pandas DataFrame df.

Function Action
pd.read_csv(filename) Reads a CSV (comma separated values) file
pd.read_excel(filename) Read an Excel file
Reading other file formats
pd.to_csv(filename) Write the dataframe to a file
df.head() Shows the first 5 rows by default. If you wish to print n rows use df.head(n)
df.tail() Shows the last 5 rows by default
df.shape Prints the dataframe dimensions (rows, columns)
df.info() Prints out dataframe information: number of rows, columns, names, data types, number of non null entries and more
df.describe() Returns a statistical analysis of float variables
df['categorical variable'].describe() Describes how many they are and how many are different
df['categorical variable'].value_counts().head(10) Counts the number of occurrences of each categorial variable and shows the first 10
df.columns Prints the names of the columns
df.columns = ['col1','col2','col3'] It names the columns according to list
df.rename(columns={'Old1' : 'New1', 'Old2' : 'New2'}, inplace=True) Renames some of the columns
df.drop_duplicates(inplace=True) Eliminates repeated rows
df.isnull().sum() Returns the sum of missing values in each variable
df.dropna() Will eliminate rows having at least one null value
df.dropna(axis=1) Will eliminate columns having at least one null value
df.mean() Computes the arithmetic mean of the dataframe
df.fillna(x_mean, inplace=True) Will replace missing values with given mean value
df.corr() Show the correlation between variables

Selecting information from a Pandas DataFrame

Function Action
df['B'] Selects column 'B'
df[ ['A','B'] ] Selects columns 'A' and 'B'
df_new = df[ ['A','B'] ] Creates a new dataframe df_new composed by two selected columns of df
df['C'] = df['A'] + df['B'] Creates a new column in df, being the sum of columns 'A' and 'B'
df_copy = df.copy() Creates a new dataframe copy of existing df
df.drop('D', axis=1, inplace=True) Eliminates column 'D' and redefines df
df.loc['2'] Returns df row with index '2'
df.loc['2','C'] Returns the specific value of df, with index=2, and column='C'
df.iloc[2] Returns row with index=2
df.loc['2':'4'] Returns rows with index '2','3', and '4'
df.iloc[2:4] Returns rows with index '2' and '3', excludes '4'
df[df['B'] == 5.0] Selects rows where the condition df['B'] equals 5.0
df[(df['B'] == 5.0) & (df['D'] <= 2.0)] Selects rows that satisfy both conditions simultaneously

The standard operators for comparing two values and conditional clauses.

Operators
Comparison '<', '<=', '==', '!=', '>=', '>',
or their wrappers '.lt()', '.le()', 'eq()', '.ne()', '.ge()', '.gt()'
Conditionals & (and), | (or)

Jupyter Notebook Examples


Available resources

Datasets

We list a small collection of repositories having available datasets for practicing data analysis.

Pandas Cheat Sheets

General References.


Created: 01/30/2022 (C. Lizarraga); Last update: 10/09/2023 (B. Huppenthal)

CC BY-NC-SA

UArizona DataLab, Data Science Institute, University of Arizona, 2024.