07_main.qmd

# 7. Data Cleaning and Preparation

## Learning Objectives

-   Know which tools to use for missing data
-   Know how to filter out missing data
-   Understand methods to fill in missing values
-   Know when and how to transform data
-   Know how to use certain `numpy` functions to handle outliers, permute, and take random samples
-   Know how to manipulate strings
-   Understand some useful methods for regular expressions
-   Learn about some helpful methods in `pandas` to explore strings
-   Understand how to handle categorical data more optimally

------------------------------------------------------------------------

```{python}
#| warning: false
import pandas as pd
import numpy as np

food = pd.read_csv("https://openmv.net/file/food-consumption.csv")

print(food.head(5))
```

*dataset: The relative consumption of certain food items in European and Scandinavian countries. The numbers represent the percentage of the population consuming that food type*

## 7.1 Handling Missing Data

Some things to note:

-   ALL DESCRIPTIVE STATISTICS ON `pandas` OBJECTS EXLUDE MISSING DATA - BY DEFAULT

-   `NaN` is used for missing values of type: `float64`

-   Values like `NaN` are called *sentinel values*

    -   a value that is not part of the input but indicates a special meaning; a signal value

    -   `NaN` for missing integers, `-1` as a value to be inserted in a function that computes only non-negative integers, etc.

```{python}
print(food.Yoghurt.isna())
```

We do have an `NaN` in our midst!

```{python}
# descriptive stats
print(np.mean(food['Yoghurt']), "\n versus", np.average(food['Yoghurt']))
```

Different results! Why?? According to `numpy` documentation:

`np.mean` always calculates the arithmetic mean along a specified axis. The first argument requires the type to be of `int64` so will take the mean of those that fit. The average is taken over the flattened array by default. `np.average` computes the *weighted* average along the specified axis.

`sum(food.Yoghurt) –> nan`

from `average` source:


            avg = avg_as_array = np.multiply(a, wgt,
                              dtype=result_dtype).sum(axis, **keepdims_kw) / scl

from `mean` source:

    if type(a) is not mu.ndarray:
            try:
                mean = a.mean
            except AttributeError:
                pass
            else:
                return mean(axis=axis, dtype=dtype, out=out, **kwargs)

        return _methods._mean(a, axis=axis, dtype=dtype,
                              out=out, **kwargs)

FYI: the `statistics` [module](https://docs.python.org/3/library/statistics.html?highlight=mean#statistics.mean) includes `mean()`

Something weird to consider....

```{python}
print(np.nan == np.nan)

# apparently, according to the floating-point standard, NaN is not equal to itself!
```

I digress...

### Filtering Missing Data

```{python}
# method dropna
print("`dropna`: option to include `how = all` to only remove rows where every value is NaN \n",food.Yoghurt.dropna().tail(), "\n",
"`fillna`: pass fillna a dictionary (fillna({1: 0.5, 2: 0})) to specify a different value for each column\n", food.Yoghurt.fillna(0).tail(), "\n",
"`isna`\n", food.Yoghurt.isna().tail(), "\n",
"`notna`\n", food.Yoghurt.notna().tail())
```

## 7.2 Data Transformation

### Removing Duplicates

Check to see is duplicates exists:

```{python}
food.duplicated()
```

If you were to have duplicates, you can use the function `drop_duplicates()`.

\*NOTE: by default, `drop_duplicates` will only return the first observed value\*

```{python}
dup_food = food[['Yoghurt','Yoghurt']]
dup_food.columns = ['a','b']
dup_food
```

```{python}
# index 11,12 are dropped - dont understand this at all
dup_food.drop_duplicates()
```

```{python}
# index 6, 10 are dropped- also dont understand this at all
dup_food.drop_duplicates(keep = 'last')
```

```{python}
# again 11,12 are dropped - still dont understand - help
dup_food.drop_duplicates(subset=['a'])
```

### Transforming Data with a Function or Mapping

Since mapping a function over a series has already been covered, this section will only go over a few more helpful ways to map.

-   define your own function - similar to how we would do in `apply` functions or `purrr:map()`

    ```{python}
    food_sub = food[:5][['Country','Yoghurt']]
    country_yogurt = {
      'Germany':'Quark',
      'Italy':'Yomo',
      'France':'Danone',
      'Holland':'Campina',
      'Belgium':'Activia'
    }
    ```

```{python}
def get_yogurt(x):
   return country_yogurt[x]

food_sub['Brand'] = food_sub['Country'].map(get_yogurt)

food_sub['Country'].map(get_yogurt)
```

### Replace Values

```{python}
print("using `replace`: \n", food_sub.replace([30],50), '\n',
"using `replace` for more than one value: \n", food_sub.replace([30, 20],[50, 40]))
```

### Renaming Axis Indices

As we've seen, standard indices are labelled as such:

    >>> food_sub.index
    RangeIndex(start=0, stop=5, step=1)

That can also be changed with the mapping of a function:

```{python}

print(food_sub.index.map(lambda x: x + 10))
print('or')
print(food_sub.index.map({0:'G', 1:'I', 2:'F', 3:'H', 4:'B'}))
```

### Discretization and Binning

It is common to convert continuous variables into discrete and group them. Let's group the affinity for yogurt into random bins:

```{python}

scale = [0, 20, 30, 50, 70]
# reasonable, ok, interesting, why

pd.cut(food.Yoghurt, scale)
```

```{python}
#|warning: false
scaled = pd.cut(food.Yoghurt.values, scale)
scaled.categories

pd.value_counts(scaled)
```

Apply the labels to the bins to have it make more sense:

```{python}
#|warning: false
scale_names = ['reasonable', 'ok', 'interesting', 'why']
pd.value_counts(pd.cut(food.Yoghurt.values, scale, labels = scale_names))
```

Finally, let `pandas` do the work for you by supplying a number of bins and a precision point. It will bin your data equally while limiting the decimal point based on the value of `precision`

```{python}
#|warning: false
pd.qcut(food.Yoghurt.values, 4, precision = 2)
```

### Detecting and Filtering Outliers

We often have to face the decision of how to handle outliers. We can choose to exclude them or to transform them.

```{python}

# let's say any country who's percentage of yogurt consumption is over 50% is an outlier

yog = food.Yoghurt
yog[yog.abs() > 50]
```

More interestingly, what if we wanted to know if the consumption of ANY food was over 50% ?

```{python}
food2 = food.drop('Country', axis = 'columns')
food2[(food2.abs() > 95).any(axis = 'columns')]
```

### Permutation and Random Sampling

-   Permuting = random reordering

    -   `np.random.permutation` = takes the length of the axis you want to permute

-   Random sampling = each sample has an equal probability of being chosen

Let's randomly reorder yogurt affinity:

```{python}

print(np.random.permutation(5))

food.take(np.random.permutation(5))
```

This method can be helpful when using `iloc` indexing!

```{python}
food.take(np.random.permutation(5), axis = 'columns')
```

Let's try taking a random subset without replacement:\

```{python}
food.sample(n =5)
# you can always add `replace=True` if you want replacement
```

### Computing Indicator/Dummy Vars

This kind of transformation is really helpful for machine learning. It converts categorical variables into indicator or *dummy* variable through a transformation that results in 0's and 1's.

```{python}
#|warning: false

pd.get_dummies(food['Country'])
```

This example is not the most helpful since this set of countries are *unique* but I hope you get the idea..

This is topic will make more sense in Ch.13 when data analysis examples are worked out.

## 7.3 Extension Data Types

Extension types addresses some of the shortcomings brought on by `numpy` such as:

-   expensive string computations

-   missing data conversions

-   lack of support for time related objects

```{python}
s = pd.Series([1, 2, 3, None])
s.dtype

```

```{python}
s = pd.Series([1, 2, 3, None], dtype=pd.Int64Dtype())
s
print(s.dtype)
```

Note that this extension type indicates missing with `<NA>`

```{python}
print(s.isna())
```

`<NA>` uses the `pandas.NA` sentinal value

```{python}
s[3] is pd.NA
```

Types can be set with `astype()`

```{python}
df = pd.DataFrame({"A": [1, 2, None, 4],
"B": ["one", "two", "three", None],
"C": [False, None, False, True]})

df["A"] = df["A"].astype("Int64")
df["B"] = df["B"].astype("string")
df["C"] = df["C"].astype("boolean")

df
```

Find a table of extension types [here](https://wesmckinney.com/book/data-cleaning.html#pandas-ext-types)

## 7.4 String Manipulation

Functions that are built in:

-   `split()` : break a string into pieces

-   `join()`

-   `strip()` : trim whitespace

-   `in()`: good for locating a substring

-   `count()` : returns the number of occurrences of a substring

-   `replace()` : substitute occurrences of one pattern for another

See more function [here](https://wesmckinney.com/book/data-cleaning.html#text_string_methods)

```{python}

lb = " layla is smart, witty, charming, and... "
lb.split(" ")
lb.strip()
'-'.join(lb)
'smart' in lb
lb.count(',')
lb.replace('...', ' bad at python.')
```

### Regular Expressions

RegEx is not easy. It takes some getting used to. It is really useful for programatically applying any of the string functions to particular pattern.

I often refer to this handy \[cheat sheet\](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf)

To use regular expression in python, you must import the module `re`:

```{python}
import re

text = "layla has lived in philadelphia county, miami-dade county, and rockdale county"

# split on whitespace
re.split(r"\s+", text)
```

To avoid repeating a common expression, you can *compile it* and store it as it's own object.

    regex = re.compile(r"\s+")

**Don't forget**: there are certain characters you must escape before using like: '\\,., +, :' and more

What if I wanted to get the counties?

```{python}

regex = re.compile(r"\w+(?=\s+county)")

regex.findall(text)
```

### String Functions

```{python}

data = {"Dave": "dave@google.com", "Steve": "steve@gmail.com",
"Rob": "rob@gmail.com", "Wes": np.nan}
# convert to series
data = pd.Series(data)
data

```

To get certain information, we can apply string functions from `Series` array-oriented methods:

```{python}
# does the string contain something
data.str.contains("gmail")
# change the extension tryp
data_as_string_ext = data.astype('string')
data_as_string_ext
```

```{python}
# vectorized element retrieval
pattern = r"([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})"
data.str.findall(pattern, flags=re.IGNORECASE).str[0]
```

## 7.5 Categorical Data

```{python}
values = pd.Series(['apple', 'orange', 'apple',
                   'apple'] * 2)
                   
pd.unique(values)
pd.value_counts(values)
```

You can improve performance by creating categorical representations that are numerical:

```{python}
values = pd.Series([0, 1, 0, 0] * 2)
dim = pd.Series(['apple', 'orange'])

dim
```

Retrieve the original set of strings with `take`

```{python}
dim.take(values)
```

### Computations with Categoricals

```{python}
rng = np.random.default_rng(seed=12345)
draws = rng.standard_normal(1000)
bins = pd.qcut(draws, 4)
bins
```

```{python}
bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins
# then use groupby
bins = pd.Series(bins, name='quartile')
results = (pd.Series(draws)
               .groupby(bins)
               .agg(['count', 'min', 'max'])
               .reset_index())
```

Leads to better performance