Skip to content

Latest commit

 

History

History
145 lines (102 loc) · 4.53 KB

pandas.md

File metadata and controls

145 lines (102 loc) · 4.53 KB

Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors Visitors

🌟 Hit star button to save this repo in your profile

Pandas

Pandas is a popular Python library for data manipulation and analysis, and it offers a wide range of functionalities that are particularly useful for conducting Exploratory Data Analysis (EDA). Here are some common pandas syntax and functions suitable for EDA:

  1. Loading Data:

    • Read data from various file formats (e.g., CSV, Excel, SQL database):

      import pandas as pd
      !wget https://raw.githubusercontent.com/drshahizan/dataset/main/titanic/train.csv -O train.csv
      df = pd.read_csv('train.csv')
  2. Data Summary:

    • Get basic information about the dataset:

      df.info()
    • Display summary statistics for numerical columns:

      df.describe()
    • View the first few rows of the dataset:

      df.head()
  3. Data Cleaning and Handling:

    • Handle missing values:

      df.isna().sum()  # Check for missing values
      df.dropna()       # Drop rows with missing values
      df.fillna(value)  # Fill missing values with a specified value
    • Remove duplicates:

      df.drop_duplicates()
  4. Data Selection and Slicing:

    • Select specific columns:

      df['column_name']
    • Select rows based on conditions:

      df[df['column_name'] > 50]
  5. Data Visualization:

    • Create basic visualizations:

      import matplotlib.pyplot as plt
      df['column_name'].plot(kind='hist')
      plt.show()
    • Pair plots for exploring relationships between multiple variables:

      import seaborn as sns
      sns.pairplot(df)
  6. Grouping and Aggregation:

    • Group data by a column and calculate statistics:

      df.groupby('category_column').mean()
  7. Correlation Analysis:

    • Compute the correlation matrix:

      df.corr()
  8. Outlier Detection:

    • Identify outliers using z-scores:

      from scipy import stats
      z_scores = np.abs(stats.zscore(df['column_name'])
  9. Data Transformation:

    • Apply functions to columns:

      df['column_name'] = df['column_name'].apply(function)
    • Apply transformations (e.g., log transformation):

      df['column_name'] = np.log(df['column_name'])
  10. Categorical Variables:

    • Get frequency counts of unique values:

      df['category_column'].value_counts()
  11. Data Export:

    • Save the modified DataFrame to a new file:

      df.to_csv('new_data.csv', index=False)

These are some of the common pandas syntax and functions you can use for EDA. Depending on your specific dataset and analysis goals, you may need to use additional pandas functions and techniques to explore and analyze your data effectively.

Contribution 🛠️

Please create an Issue for any improvements, suggestions or errors in the content.

You can also contact me using Linkedin for any other queries or feedback.

Visitors