Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cases-rki-by-ags.csv - implausible data for entity 8126 - Hohenlohekreis #906

Open
mathiasflick opened this issue Apr 18, 2021 · 9 comments

Comments

@mathiasflick
Copy link

Cases for 2021-04-13: 4502
Cases for 2021-04-14: 4263 (239 less than day before!)
Cases for 2021-04-15: 4306
...

This is not in line with usual 'corrective action' taken by RKI (which would normally keep continuity)...
Or am I missing something?

Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

Hey Mathias!

(which would normally keep continuity)...

normally, yes. Haha.

I have seen a small number of 'data corruptions' on their side, that they fixed within a small number of days. I did not always document these incidents, but here is one example: #608

For now let's maybe wait for a little bit and see if this disappears.

Happy to take a closer look otherwise.

'data mangling' in the real world is hard, as we can see here. They have a pretty robust data pipeline by now I think, but I am sure there's still potential of human error every day, like fatfingering something in a spreadsheet. We'll see, of course the problem might be entirely else.

Thank you for the report, as always!

@jgehrcke
Copy link
Owner

jgehrcke commented Apr 20, 2021

Problem still persists today:

$ curl -sO https://raw.githubusercontent.com/jgehrcke/covid-19-germany-gae/master/cases-rki-by-ags.csv && \
    python -c 'import pandas as pd; df=pd.read_csv("cases-rki-by-ags.csv",index_col=["time_iso8601"]); print(df["8126"][-20:])'
time_iso8601
2021-03-31T17:00:00+0000    4050
2021-04-01T17:00:00+0000    4094
2021-04-02T17:00:00+0000    4170
2021-04-03T17:00:00+0000    4178
2021-04-04T17:00:00+0000    4180
2021-04-05T17:00:00+0000    4181
2021-04-06T17:00:00+0000    4190
2021-04-07T17:00:00+0000    4257
2021-04-08T17:00:00+0000    4323
2021-04-09T17:00:00+0000    4358
2021-04-10T17:00:00+0000    4420
2021-04-11T17:00:00+0000    4431
2021-04-12T17:00:00+0000    4442
2021-04-13T17:00:00+0000    4211
2021-04-14T17:00:00+0000    4271
2021-04-15T17:00:00+0000    4315
2021-04-16T17:00:00+0000    4356
2021-04-17T17:00:00+0000    4413
2021-04-18T17:00:00+0000    4425
2021-04-19T17:00:00+0000    4430
Name: 8126, dtype: int64

Non-monotonic between these two:

2021-04-12T17:00:00+0000    4442
2021-04-13T17:00:00+0000    4211

@jgehrcke
Copy link
Owner

OK. Found the artifact in the primary source, too:

On the BW state website for COVID-19 there is a link to this spreadsheet:

https://sozialministerium.baden-wuerttemberg.de/fileadmin/redaktion/m-sm/intern/downloads/Downloads_Gesundheitsschutz/Tabelle_Coronavirus-Faelle-BW.xlsx

Screenshot from 2021-04-20 16-07-25

I downloaded and inspected it just now (April 20, 2021, 16:00 local time) and found:

Screenshot from 2021-04-20 16-06-54

There, the non-monotonicity is between April 16 and April 17.

@jgehrcke
Copy link
Owner

Well. The actual primary source seems to be https://www.corona-im-hok.de/ and there is a link to this ArcGIS system: https://lra-hok.maps.arcgis.com/apps/dashboards/d770da3ea38643bbbd662a3e05bad5a9

When I play with that dashboard and look up the cumulative case count curve, and zoom in to the interesting date range I do not see the artifact:

Screenshot from 2021-04-20 16-12-47

The red line seems to be what we're looking for. It has the following data points:

April 12: 4508
April 13: 4568
April 14: 4628
April 15: 4670 
April 16: 4711 
April 17: 4755

@jgehrcke
Copy link
Owner

jgehrcke commented Apr 20, 2021

Did leave a note with the BW state via contact form. I doubt this will even reach the right desk. : )

Die Daten für den Hohenlohekreis in https://sozialministerium.baden-wuerttemberg.de/fileadmin/redaktion/m-sm/intern/downloads/Downloads_Gesundheitsschutz/Tabelle_Coronavirus-Faelle-BW.xlsx sind rund um den 16. April scheinbar falsch.

16. April: 4634
17. April: 4398

Der kumulative Wert darf nicht sinken in dieser Zeitrichtung. Das ist bestimmt ein Tippfehler, oder?

Der Landkreis selbst gibt an (per Daten auf https://www.corona-im-hok.de/):

April 12: 4508
April 13: 4568
April 14: 4628
April 15: 4670 
April 16: 4711 
April 17: 4755

Mehr Diskussion zum Problem: https://github.com/jgehrcke/covid-19-germany-gae/issues/906#issuecomment-823305444

@mathiasflick
Copy link
Author

Very good findings, indeed!
I am a little bit concerned about the fact, that RKI obviously develop their figures for total count and 7-days-incidence not from a single element timeline but from at least two ones, which do not necessarily need to be consistent. Otherwise their 7-days-incidence should be corrupted (like the one in my - and your :-) - (geo)graphical representation.

Screenshot_2021-04-20 seven-day-incidence-by-ags - Jupyter Notebook(1)
Screenshot_2021-04-20 seven-day-incidence-by-ags - Jupyter Notebook

This is, what the RKI dashboard infobox says:
Screenshot_2021-04-20 RKI COVID-19 Germany

7-days-incidence plausible according to your trustworthy source. Total number of cases in line with (most likely wrong) timeline of total cases!

Greetings from Cologne
Mathias

@jgehrcke
Copy link
Owner

Did leave a note with the BW state via contact form. I doubt this will even reach the right desk. : )

I got a reply that my message was forwarded to the appropriate department :)

@mathiasflick
Copy link
Author

Curiosity is every engineer’s best asset ...
Motivated by this discussion, I have build a little script to systematically check respective csv for non-monotonicity.
This is what I found: there are 88 issues for 75 out of 401 entities (excluding "Berlin details" and 'sum_cases').
Most of them are minor (e.g. deviation of -1), but setting the threshold to 10 (giving all deviations of 11 and more) this is the result (sorry for the bad formatting, though):

5116 Mönchengladbach, Stadt 2021-03-14 -32
5754 Gütersloh 2021-03-14 -165
8111 Stuttgart, Stadtkreis 2020-05-03 -13
8126 Hohenlohekreis 2021-04-13 -229
8226 Rhein-Neckar-Kreis 2021-03-08 -24
9362 Regensburg 2020-12-18 -13
9563 Fürth 2021-03-14 -25
9573 Fürth 2021-03-14 -28
9678 Schweinfurt 2021-03-14 -87
10041 Regionalverband Saarbrücken 2021-03-12 -67
12065 Oberhavel 2020-05-03 -12
15082 Anhalt-Bitterfeld 2021-03-14 -16
16053 Jena, Stadt 2021-04-13 -11
16053 Jena, Stadt 2021-04-16 -22
16075 Saale-Orla-Kreis 2021-03-14 -48

Using a threshold of 10 there are 15 issues for 14 out of 401 entities!
Besides the fact that our "Hohenlohe case" is indeed the most significant one, very obviously something happened on 2021-03-14 (with seven out of the 15 issues found with threshold 10 ...).

Greetings from Cologne
Mathias

@mathiasflick
Copy link
Author

Some changes during the last few days (unfortunately not for good ...)!
Output of my little script using a threshold of 10; new/modified issues are highlighted in bold:

5116 Mönchengladbach, Stadt 2021-03-14 -32
5754 Gütersloh 2021-03-14 -165
8111 Stuttgart, Stadtkreis 2020-05-03 -13
8126 Hohenlohekreis 2021-04-13 -229
8226 Rhein-Neckar-Kreis 2021-03-08 -24
9362 Regensburg 2020-12-18 -13
9471 Bamberg 2021-05-03 -12
9563 Fürth 2021-03-14 -25
9564 Nürnberg 2021-05-03 -124
9573 Fürth 2021-03-14 -28
9662 Schweinfurt 2021-05-03 -16
9678 Schweinfurt 2021-03-14 -87
9678 Schweinfurt 2021-05-03 -16
10041 Regionalverband Saarbrücken 2021-03-12 -67
12065 Oberhavel 2020-05-03 -12
15082 Anhalt-Bitterfeld 2021-03-14 -16
16053 Jena, Stadt 2021-04-13 -51
16075 Saale-Orla-Kreis 2021-03-14 -48

Using a threshold of 10 there are 18 issues for 17 out of 401 entities!
Total sum of issues is 978 cases.

Since Sunday (2021-05-09) there are 168 additional cases (with threshold of 10); Sum of cases was 810 until Sunday (2021-05-09).

Greetings from Cologne
Mathias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants