Skip to content

Commit

Permalink
fix: Minor fixes for visualization tutorial, summary & heatmap (#391)
Browse files Browse the repository at this point in the history
### Summary of Changes

Fixed the visualization tutorial with discussed changes from #380 as
well as 3 of the bullet points from #381 (namely summary header, summary
all string and heatmap color scheme).

---------

Co-authored-by: SmiteDeluxe <[email protected]>
  • Loading branch information
SmiteDeluxe and SmiteDeluxe authored Jan 31, 2023
1 parent 483a974 commit 4926613
Show file tree
Hide file tree
Showing 6 changed files with 58 additions and 31 deletions.
4 changes: 2 additions & 2 deletions Runtime/safe-ds/safeds/data/tabular/_table.py
Original file line number Diff line number Diff line change
Expand Up @@ -835,14 +835,14 @@ def summary(self) -> Table:

for function in statistics.values():
try:
values.append(function())
values.append(str(function()))
except NonNumericColumnError:
values.append("-")

result = pd.concat([result, pd.DataFrame(values)], axis=1)

result = pd.concat([pd.DataFrame(list(statistics.keys())), result], axis=1)
result.columns = [""] + self.get_column_names()
result.columns = ["metrics"] + self.get_column_names()

This comment has been minimized.

Copy link
@Gerhardsa0

Gerhardsa0 Jan 31, 2023

Contributor

you forgot to change the name in the test function from "" -> "metrics" so the test failed in test_summary, took me while to see this because i thought the test failed because of my changes

is fixed in PR #396


return Table(result)

Expand Down
1 change: 1 addition & 0 deletions Runtime/safe-ds/safeds/plotting/_correlation_heatmap.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ def correlation_heatmap(table: Table) -> None:
vmax=1,
xticklabels=table.get_column_names(),
yticklabels=table.get_column_names(),
cmap="vlag",
)
plt.tight_layout()
plt.show()
36 changes: 24 additions & 12 deletions Runtime/safe-ds/tests/data/tabular/_table/test_summary.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,19 +22,31 @@ def test_summary() -> None:
"row count",
],
"col1": [
2.0,
1.0,
4.0 / 3,
1.0,
1.0,
4.0,
1.0 / 3,
table._data[0].std(),
2.0 / 3,
2.0 / 3,
3.0,
"2",
"1",
str(4.0 / 3),
"1",
"1.0",
"4",
str(1.0 / 3),
str(table._data[0].std()),
str(2.0 / 3),
str(2.0 / 3),
"3",
],
"col2": [
"-",
"-",
"-",
"a",
"-",
"-",
"-",
"-",
"1.0",
str(1.0 / 3),
"3",
],
"col2": ["-", "-", "-", "a", "-", "-", "-", "-", 1.0, 1.0 / 3, 3],
}
)
)
Expand Down
Binary file modified docs/Stdlib/python/Tutorials/Resources/Heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/Stdlib/python/Tutorials/Resources/Summary.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 31 additions & 17 deletions docs/Stdlib/python/Tutorials/visualization.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@

The following code will use a Jupyter Notebook environment.

First we need some data to visualize. For this we use the common example of the titanic disaster.
## Table & Statistics
First, we need some data to visualize. For this, we use the common example of the titanic disaster.

!!! note
You can download that dataset on [kaggle](https://www.kaggle.com/c/titanic).
Expand All @@ -12,33 +13,33 @@ from safeds.data.tabular import Table
data = Table.from_csv("path/to/your/data.csv")
```

Now we want to have look at what our dataset looks like. For this we use Jupyter Notebooks native display function.
Now we want to have a look at what our dataset looks like. For this, we use Jupyter Notebooks native display function.

```python
data # calls display(data)
data    # calls display(data)
```

![Table](./Resources/Table.png)

Next some statistics.

```python
data.summary() # returns a table with various statistics for each column
data.summary()  # returns a table with various statistics for each column
```

![Summary](./Resources/Summary.png)

As you can see here, the **idness** of the column _PassangerId_ is 1. This means, that every row has a unique value for
As you can see here, the **idness** of the column _PassengerId_ is 1. This means, that every row has a unique value for
this column. Since this isn't helpful for our usecase we can drop it.

```python
data_cleaned = data.drop_columns(["PassangerId"])
data_cleaned = data.drop_columns(["PassengerId"])
```

## Heatmap
Now we have a rough idea of what we are looking at. But we still don't really know a lot about our dataset.
So next we can start to plot a our columns against each other in a so called Heatmap, to understand which values relate to each other.
So next we can start to plot our columns against each other in a so called Heatmap, to understand which values relate to each other.

But since this type of diagramm only works for numerical values, we are going to use only those.
But since this type of diagram only works for numerical values, we are going to use only those.

```python
from safeds.plotting import correlation_heatmap
Expand All @@ -49,8 +50,9 @@ correlation_heatmap(data_only_numerics)

![Heatmap](./Resources/Heatmap.png)

As you can see, the columns _Fare_ and _Pclass_ (Passanger Class) seem to heavily correlate. Let's have another look at that.
We'll use a linechart to better understand their relationship.
As you can see, the columns _Fare_ and _Pclass_ (Passenger Class) seem to heavily correlate. Let's have another look at that.
## Lineplot
We'll use a lineplot to better understand their relationship.

```python
from safeds.plotting import lineplot
Expand All @@ -61,19 +63,31 @@ lineplot(data_cleaned, "Pclass", "Fare")

The line itself represents the central tendency and the hued area around it a confidence interval for that estimate.

We can conclude that tickets for first class rooms are much more expensive compared to second and third class.
Also the difference between second and third is less pronounced.
We can conclude that tickets for first classrooms are much more expensive compared to second and third class.
Also, the difference between second and third is less pronounced.

Some other plots that might be useful are boxplots, histogams and scatterplots.
## Other plots
Some other plots that might be useful are boxplots, histograms and scatterplots.

```python
from safeds.plotting import boxplot, histogram, scatterplot
from safeds.plotting import boxplot

boxplot(data_cleaned.get_column("Age"))
histogram(data_cleaned.get_column("Fare"))
scatterplot(data_cleaned, "Age", "Fare")
```

![Boxplot](./Resources/Boxplot.png)

```python
from safeds.plotting import histogram

histogram(data_cleaned.get_column("Fare"))
```
![Histogram](./Resources/Histogram.png)

```python
from safeds.plotting import scatterplot

scatterplot(data_cleaned, "Age", "Fare")
```

![Scatterplot](./Resources/Scatterplot.png)

0 comments on commit 4926613

Please sign in to comment.