Skip to content

Commit

Permalink
5/16 Update
Browse files Browse the repository at this point in the history
  • Loading branch information
AaronWongNSC committed May 16, 2020
1 parent fbf13d4 commit 0502076
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,6 @@ category: code
published: True
---

I took a deeper dive into the documentation for categories, and I discovered something that is making me pause for a bit. Apparently, the purpose of categories was to create some sort of folder structure in the website. I was approaching it more as an "advanced tab" situation, where the overall structure was still flat but the categories would form a virtual nesting organization. For example, all of the Data Science from Scratch posts could be found by looking at the categories and tracing the pattern of learning to books to Data_Science_From_Scratch. But this feels very different from having the categories lead to nested folders.
I took a deeper dive into the documentation for categories, and I discovered something that is making me pause for a bit. Apparently, the purpose of categories was to create some sort of folder structure in the website. I was approaching it more as an "advanced tag" situation, where the overall structure was still flat but the categories would form a virtual nesting organization. For example, all of the Data Science from Scratch posts could be found by looking at the categories and tracing the pattern of learning to books to Data_Science_From_Scratch. But this feels very different from having the categories lead to nested folders.

So at this point, I'm going to step away from playing around with this but will maybe come back to it later.
45 changes: 45 additions & 0 deletions _posts/2020-05-16-Data-Science-From-Scratch-Chapter-10.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
layout: post
title: Data Science from Scratch -- Chapter 10
tags: books Python Data_Science_from_Scratch
category: learning books Data_Science_from_Scratch
published: True
---

This data exploration is a good overview chapter, but since it's doing the "from scratch" approach, I think it leaves a lot of tools on the table for the future. While I've appreciated seeing some of the other pieces from scratch, in this case it seems that it may hurt more than it helps.

## One-Dimensional Data

There's not a lot to see here. He creates a function to help create histograms and shows that two distributions can have the same mean and standard deviation while not having the same distribution.

## Two-Dimensional Data

There's also not much to see here. He creates a scatter plot of two different correlations and then calculates the correlation coefficient using the statistics functions that were created earlier.

## Higher Dimensional Data

This section gets a bit more interesting. He uses his functions to create a correlation matrix, though he doesn't really talk about it. More interestingly, he also makes a scatterplot matrix from scratch. I'm only loosely familiar with making subplots with matplotlib, and so I mostly have to guess at things and mess around until it works. I'll probably have to work through some examples of this when I get to the Python Data Science Handbook.

## NamedTuples and Dataclasses

I really wasn't sure why he was bothering with NamedTuples instead of just going to pandas, but Dataclasses are apparently a new Python 3.7 thing (which is funny because the book is set up using Python 3.6). I'm expecting to stick with pandas, so I didn't spend a lot of time playing with this.

## Cleaning and Munging

I've not heard of data munging before, but I have heard it called data wrangling. He write a manual checking function, but pandas is a thing and has its own methods for doing that. So I didn't spend much time here, either.

## Manipulating Data

This was actually somewhat interesting. Even though there's nothing fancy here, I've just never approached things using list comprehension in quite this manner. Here's his code for getting the highest ever closing price for Apple stock (after the data was already loaded as a list of NamedTuples).

```Python
max_aapl_price = max(stock_price.closing_price
for stock_price in data
if stock_price.symbol == "AAPL")
```

I though this was just a clever and compact way of doing things. I would have actually created the list for Apple stock as a separate object, and then found the maximum of that list.

## Rescaling

He spends a brief section talking about rescaling data. He shows how the units can sometimes be deceptive by showing how nearest neighbors can change by changing the units, and then moves on. The rescaling of the data fits in perfectly with calculating the correlation coeffeicient. I think I've called this process "normalizing" but [upon looking it up](https://www.statisticshowto.com/normalized/) it seems that this is not quite the right word and that I should be calling it standardizing the data.

0 comments on commit 0502076

Please sign in to comment.