Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Minor fixes for visualization tutorial, summary & heatmap #391

Merged
merged 7 commits into from
Jan 31, 2023
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Runtime/safe-ds/safeds/data/tabular/_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -835,14 +835,14 @@ def summary(self) -> Table:

for function in statistics.values():
try:
values.append(function())
values.append(str(function()))
except NonNumericColumnError:
values.append("-")

result = pd.concat([result, pd.DataFrame(values)], axis=1)

result = pd.concat([pd.DataFrame(list(statistics.keys())), result], axis=1)
result.columns = [""] + self.get_column_names()
result.columns = ["metrics"] + self.get_column_names()

return Table(result)

Expand Down
1 change: 1 addition & 0 deletions Runtime/safe-ds/safeds/plotting/_correlation_heatmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def correlation_heatmap(table: Table) -> None:
vmax=1,
xticklabels=table.get_column_names(),
yticklabels=table.get_column_names(),
cmap="vlag",
)
plt.tight_layout()
plt.show()
36 changes: 24 additions & 12 deletions Runtime/safe-ds/tests/data/tabular/_table/test_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,31 @@ def test_summary() -> None:
"row count",
],
"col1": [
2.0,
1.0,
4.0 / 3,
1.0,
1.0,
4.0,
1.0 / 3,
table._data[0].std(),
2.0 / 3,
2.0 / 3,
3.0,
"2",
"1",
str(4.0 / 3),
"1",
"1.0",
"4",
str(1.0 / 3),
str(table._data[0].std()),
str(2.0 / 3),
str(2.0 / 3),
"3",
],
"col2": [
"-",
"-",
"-",
"a",
"-",
"-",
"-",
"-",
"1.0",
str(1.0 / 3),
"3",
],
"col2": ["-", "-", "-", "a", "-", "-", "-", "-", 1.0, 1.0 / 3, 3],
}
)
)
Expand Down
Binary file modified docs/Stdlib/python/Tutorials/Resources/Heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/Stdlib/python/Tutorials/Resources/Summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 31 additions & 17 deletions docs/Stdlib/python/Tutorials/visualization.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

The following code will use a Jupyter Notebook environment.

First we need some data to visualize. For this we use the common example of the titanic disaster.
## Table & Statistics
First, we need some data to visualize. For this, we use the common example of the titanic disaster.

!!! note
You can download that dataset on [kaggle](https://www.kaggle.com/c/titanic).
Expand All @@ -12,33 +13,33 @@ from safeds.data.tabular import Table
data = Table.from_csv("path/to/your/data.csv")
```

Now we want to have look at what our dataset looks like. For this we use Jupyter Notebooks native display function.
Now we want to have a look at what our dataset looks like. For this, we use Jupyter Notebooks native display function.

```python
data # calls display(data)
data    # calls display(data)
```

![Table](./Resources/Table.png)

Next some statistics.

```python
data.summary() # returns a table with various statistics for each column
data.summary()  # returns a table with various statistics for each column
```

![Summary](./Resources/Summary.png)

As you can see here, the **idness** of the column _PassangerId_ is 1. This means, that every row has a unique value for
As you can see here, the **idness** of the column _PassengerId_ is 1. This means, that every row has a unique value for
this column. Since this isn't helpful for our usecase we can drop it.

```python
data_cleaned = data.drop_columns(["PassangerId"])
data_cleaned = data.drop_columns(["PassengerId"])
```

## Heatmap
Now we have a rough idea of what we are looking at. But we still don't really know a lot about our dataset.
So next we can start to plot a our columns against each other in a so called Heatmap, to understand which values relate to each other.
So next we can start to plot our columns against each other in a so called Heatmap, to understand which values relate to each other.

But since this type of diagramm only works for numerical values, we are going to use only those.
But since this type of diagram only works for numerical values, we are going to use only those.

```python
from safeds.plotting import correlation_heatmap
Expand All @@ -49,8 +50,9 @@ correlation_heatmap(data_only_numerics)

![Heatmap](./Resources/Heatmap.png)

As you can see, the columns _Fare_ and _Pclass_ (Passanger Class) seem to heavily correlate. Let's have another look at that.
We'll use a linechart to better understand their relationship.
As you can see, the columns _Fare_ and _Pclass_ (Passenger Class) seem to heavily correlate. Let's have another look at that.
## Lineplot
We'll use a lineplot to better understand their relationship.

```python
from safeds.plotting import lineplot
Expand All @@ -61,19 +63,31 @@ lineplot(data_cleaned, "Pclass", "Fare")

The line itself represents the central tendency and the hued area around it a confidence interval for that estimate.

We can conclude that tickets for first class rooms are much more expensive compared to second and third class.
Also the difference between second and third is less pronounced.
We can conclude that tickets for first classrooms are much more expensive compared to second and third class.
Also, the difference between second and third is less pronounced.

Some other plots that might be useful are boxplots, histogams and scatterplots.
## Other plots
Some other plots that might be useful are boxplots, histograms and scatterplots.

```python
from safeds.plotting import boxplot, histogram, scatterplot
from safeds.plotting import boxplot

boxplot(data_cleaned.get_column("Age"))
histogram(data_cleaned.get_column("Fare"))
scatterplot(data_cleaned, "Age", "Fare")
```

![Boxplot](./Resources/Boxplot.png)

```python
from safeds.plotting import histogram

histogram(data_cleaned.get_column("Fare"))
```
![Histogram](./Resources/Histogram.png)

```python
from safeds.plotting import scatterplot

scatterplot(data_cleaned, "Age", "Fare")
```

![Scatterplot](./Resources/Scatterplot.png)