Skip to content

Latest commit

 

History

History
99 lines (76 loc) · 4.15 KB

exer4.md

File metadata and controls

99 lines (76 loc) · 4.15 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

Exercise 4: EDA - Descriptive Statistics

The steps to calculate summary statistics for numerical columns using pandas methods like describe(), mean(), median(), and std(). We will also create frequency tables for categorical columns.

Step 1: Calculate Summary Statistics for Numerical Columns

  1. Load the Titanic Dataset:

    • Ensure the dataset is loaded into a pandas DataFrame.
    import pandas as pd
    !wget https://raw.githubusercontent.com/drshahizan/dataset/main/titanic/train.csv -O train.csv
    df = pd.read_csv('train.csv')
  2. Use the describe() Method:

    • The describe() method provides a summary of statistics for numerical columns.
    df.describe()
  3. Calculate Mean, Median, and Standard Deviation:

    • Use the mean(), median(), and std() methods to calculate these statistics for numerical columns.
    mean_values = df.mean()
    median_values = df.median()
    std_values = df.std()

Step 2: Create Frequency Tables for Categorical Columns

  1. Identify Categorical Columns:

    • Use the select_dtypes() method to select columns with object data type (typically used for categorical data).
    categorical_columns = df.select_dtypes(include=['object']).columns
  2. Create Frequency Tables:

    • Use the value_counts() method to create frequency tables for each categorical column.
    frequency_tables = {col: df[col].value_counts() for col in categorical_columns}

Step-by-Step Execution

  1. Load the Titanic Dataset:

    import pandas as pd
    df = pd.read_csv('train.csv')
  2. Use the describe() Method:

    summary_statistics = df.describe()
    print(summary_statistics)
  3. Calculate Mean, Median, and Standard Deviation:

    mean_values = df.mean()
    median_values = df.median()
    std_values = df.std()
    
    print("Mean Values:\n", mean_values)
    print("Median Values:\n", median_values)
    print("Standard Deviation Values:\n", std_values)
  4. Identify Categorical Columns:

    categorical_columns = df.select_dtypes(include=['object']).columns
    print("Categorical Columns:\n", categorical_columns)
  5. Create Frequency Tables:

    frequency_tables = {col: df[col].value_counts() for col in categorical_columns}
    for col, freq_table in frequency_tables.items():
        print(f"Frequency Table for {col}:\n{freq_table}\n")

By following these steps, you will have calculated summary statistics for numerical columns and created frequency tables for categorical columns in the Titanic dataset.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors