Skip to content

Commit

Permalink
Created the scipy-stats concept file (#5877)
Browse files Browse the repository at this point in the history
* Created the scipy stats file

* syntax update

* Update scipy-stats.md

* Formating fixes

* Update scipy-stats.md

minor fixes

---------
  • Loading branch information
dakshdeepHERE authored Dec 22, 2024
1 parent 186d5cc commit 88e1876
Showing 1 changed file with 79 additions and 0 deletions.
79 changes: 79 additions & 0 deletions content/scipy/concepts/scipy-stats/scipy-stats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
---
Title: 'scipy.stats'
Description: 'scipy.stats is a Python module offering statistical functions, distributions, and hypothesis tests for data analysis.'
Subjects:
- 'Data Science'
- 'Machine Learning'
Tags:
- 'Distributions'
- 'Hypothesis Testing'
- 'Python'
- 'Statistics'
CatalogContent:
- 'learn-python'
- 'paths/data-science'
---

The **`scipy.stats`** module is part of the broader [SciPy](https://www.codecademy.com/resources/docs/scipy) library for scientific computing in Python. It provides functionality for working with various probability distributions, conducting hypothesis tests, and computing descriptive statistics. By leveraging `scipy.stats`, data scientists and analysts can quickly explore their data, model it using theoretical distributions, and draw meaningful conclusions through statistical inference.

## Probability Distributions

`scipy.stats` provides a wide range of distributions (e.g., Normal, Exponential, Binomial) with methods to work with them. For example, for the Normal distribution:

```pseudo
stats.norm.pdf(x) # Probability Density Function
stats.norm.cdf(x) # Cumulative Distribution Function
stats.norm.rvs(size=n) # Generate random samples
```

- `pdf`: Returns the probability density function (PDF) value at a given point for continuous distributions..
- `cdf`: Gives the probability that a random variable is less than or equal to a certain value.
- `rvs`: Draws random samples from the specified distribution.

These methods can be used with other distributions available in `scipy.stats` by replacing norm with the desired distribution (e.g., `expon`, `binom`).

## Descriptive Statistics

Compute common statistical measures with both `numpy` and `scipy.stats`:

```pseudo
np.mean(data)
np.median(data)
stats.mode(data)
stats.describe(data)
```

- `mean()`: Computes the average value of the data.
- `median()`: Finds the middle value separating the higher and lower halves of the data.
- `mode()`: Returns the most frequently occurring value (for multi-modal data, it returns the smallest mode).
- `describe()`: Provides a quick summary of the data, including count, min, max, mean, variance, skewness, and kurtosis.

> **Note**: While `mean` and `median` are part of `numpy`, `mode` and `describe` belong to `scipy.stats`.
## Hypothesis Testing

Perform a variety of statistical tests to assess differences or relationships:

```pseudo
stats.ttest_ind(group1, group2) # Independent t-test
stats.chisquare(observed, expected) # Chi-square test
stats.mannwhitneyu(group1, group2) # Mann-Whitney U test
```

- `ttest_ind()`: Checks if the means of two independent samples differ significantly.
- `chisquare()`: Compares observed frequencies to expected frequencies for a goodness-of-fit test.
- `mannwhitneyu()`: Tests for differences in the distribution of two independent samples (non-parametric).

## Correlation and Regression

Evaluate relationships between variables:

```pseudo
stats.pearsonr(x, y) # Pearson correlation
stats.spearmanr(x, y) # Spearman rank correlation
stats.kendalltau(x, y) # Kendall’s Tau correlation
```

- `pearsonr()`: Measures linear correlation between two datasets.
- `spearmanr()`: Measures rank-based correlation, less sensitive to non-linear relationships.
- `kendalltau()`: Measures the association between two measured quantities using rank correlation.

0 comments on commit 88e1876

Please sign in to comment.