-
Notifications
You must be signed in to change notification settings - Fork 2
Data Wrangling 101: Pandas in Action
📓 These notes: https://github.com/ua-datalab/Workshops/wiki/Data-Wrangling-101%3A-Pandas-in-Action
💻 Code: https://github.com/ua-datalab/Workshops/blob/main/Introduction-to-Pandas/pandas-intro.ipynb
📹 Link to Zoom recording (10-10-2023)
(Image Credit: Wikimedia Commons, CC)
Pandas is a Python Library designed by Wes McKinney for data manipulation and data analysis. Pandas basic data structures objects are 1-dimensional Series and 2-dimensional DataFrame.
Pandas can read a wide variety of data formats, such as comma separated values (csv), Microsoft Excel files, JSON, SQL tables and queries, and more. Pandas is a base tool for data cleansing and data wrangling.
Learning Objetives
- Introduction to data structures
- How to create a DataFrame data structure
- Reading input data files into a DataFrame
- DataFrame indexing
- Working with text and categorial variables
- Basic operations on a DataFrame
- How to select rows or columns from a DataFrame
- How to add or delete rows or columns in a DataFrame
- How to merge, join, concatenate and compare DataFrames
Please see Slides: tinyurl.com/UAD7-PythonDataAnalysis
The Pandas Python Library was developed in 2008 by Wes McKinney, for performing data manipulation and analysis. Pandas uses data structures and functions for manipulating numerical tables and time series.
What is a DataFrame?. It is a two-dimensional, size-mutable, potentially heterogeneous tabular data.
We will again use Google Colab where the Pandas is included in the available software libraries like NumPy and Matplotlib used in the previous two weeks.
Function | Action |
---|---|
pd.read_csv(filename) | Reads a CSV (comma separated values) file |
pd.read_excel(filename) | Read an Excel file |
Reading other file formats | |
pd.to_csv(filename) | Write the dataframe to a file |
df.head() | Shows the first 5 rows by default. If you wish to print n rows use df.head(n) |
df.tail() | Shows the last 5 rows by default |
df.shape | Prints the dataframe dimensions (rows, columns) |
df.info() | Prints out dataframe information: number of rows, columns, names, data types, number of non null entries and more |
df.describe() | Returns a statistical analysis of float variables |
df['categorical variable'].describe() | Describes how many they are and how many are different |
df['categorical variable'].value_counts().head(10) | Counts the number of occurrences of each categorial variable and shows the first 10 |
df.columns | Prints the names of the columns |
df.columns = ['col1','col2','col3'] | It names the columns according to list |
df.rename(columns={'Old1' : 'New1', 'Old2' : 'New2'}, inplace=True) | Renames some of the columns |
df.drop_duplicates(inplace=True) | Eliminates repeated rows |
df.isnull().sum() | Returns the sum of missing values in each variable |
df.dropna() | Will eliminate rows having at least one null value |
df.dropna(axis=1) | Will eliminate columns having at least one null value |
df.mean() | Computes the arithmetic mean of the dataframe |
df.fillna(x_mean, inplace=True) | Will replace missing values with given mean value |
df.corr() | Show the correlation between variables |
Function | Action |
---|---|
df['B'] | Selects column 'B' |
df[ ['A','B'] ] | Selects columns 'A' and 'B' |
df_new = df[ ['A','B'] ] | Creates a new dataframe df_new composed by two selected columns of df |
df['C'] = df['A'] + df['B'] | Creates a new column in df, being the sum of columns 'A' and 'B' |
df_copy = df.copy() | Creates a new dataframe copy of existing df |
df.drop('D', axis=1, inplace=True) | Eliminates column 'D' and redefines df |
df.loc['2'] | Returns df row with index '2' |
df.loc['2','C'] | Returns the specific value of df, with index=2, and column='C' |
df.iloc[2] | Returns row with index=2 |
df.loc['2':'4'] | Returns rows with index '2','3', and '4' |
df.iloc[2:4] | Returns rows with index '2' and '3', excludes '4' |
df[df['B'] == 5.0] | Selects rows where the condition df['B'] equals 5.0 |
df[(df['B'] == 5.0) & (df['D'] <= 2.0)] | Selects rows that satisfy both conditions simultaneously |
The standard operators for comparing two values and conditional clauses.
Operators | |
---|---|
Comparison | '<', '<=', '==', '!=', '>=', '>', |
or their wrappers | '.lt()', '.le()', 'eq()', '.ne()', '.ge()', '.gt()' |
Conditionals | & (and), | (or) |
- Reading data files in Jupyter Notebooks
- Introduction to Pandas
- More examples
- Examples from Python Data Science Handbook. Jake VanderPlas.
We list a small collection of repositories having available datasets for practicing data analysis.
- Google Dataset Search
- Kaggle
- Papers with Code
- University of California at Irvine
- US Census Bureau
- US Data.Gov
- Pandas Tutorials
- Pandas Cookbook
- Pandas User Guide
- Python for Data Analysis, 3rd. Edition. Wes McKinney (O'Reilly Media, 2021).
- Python Data Science Handbook. Jake VanderPlas (O'Reilly Media, 2016).
Created: 01/30/2022 (C. Lizarraga); Last update: 10/03/2023 (B. Huppenthal)
UArizona DataLab, Data Science Institute, University of Arizona, 2024.