Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomaly detection: round 2 #93

Open
kathsherratt opened this issue Feb 17, 2021 · 5 comments
Open

Anomaly detection: round 2 #93

kathsherratt opened this issue Feb 17, 2021 · 5 comments
Assignees
Labels

Comments

@kathsherratt
Copy link
Contributor

Comments by @seabbs

I think the missing piece of the puzzle is to go back to doing some anomaly correction but in a less hardcore fashion that we did last time. Previously, we corrected all of the data (both that used for fitting and that used when plotting) and it led us a little astray (we thought we were doing well but never saw the truth data and so never knew how we were actually doing).
Adding a second anomaly cleaned data stream and using that for fitting whilst keeping the current truth data everywhere else seems like a good option.
In terms of anomaly detection something fairly light seems sensible. Perhaps just having an allowed week to week change (i.e Monday to Monday and perhaps in the order of 200%) and setting to the backwards looking 7 day average if it exceeds this?
The other critical thing we didn't have before was some awareness of how much and when we are doing this so flagging that and perhaps adding to the summary report seems like it would be really useful.

@kathsherratt
Copy link
Contributor Author

kathsherratt commented Feb 17, 2021

I guess this involves:

  • Check data anomalies
    • Questions:
      • checks for weekly? daily? both? anomalies
      • correct with average (mean? median) of last 7 days
    • Files to update:
      • get-us-data.R
      • report.Rmd - flag states with anomalies.
  • Use "corrected" data in model fitting and ensembling
    • Questions:
      • presumably use "corrected" data in all models?
      • are we fitting each model to both sets of data and comparing; or fitting to the "corrected" data only?
    • Files to update:
      • models/rt/update-rt.R
      • models/timeseries/update-timeseries.R
      • models/deaths-conv-cases/update-conv.R
      • Ensembling already uses whichever data was used for fitting Rt model so no update needed
  • Use "truth" data in plotting
    • evaluation/ensembles.R
    • evaluation/models.R

@seabbs
Copy link
Contributor

seabbs commented Feb 17, 2021

Nice work Kath,

Some thoughts:

  • I think checking for daily anomalies makes the most sense as it should hopefully be easier to detect and eyeball problems.
  • Mean I think - not ideal but 🤷
  • Fitting to just corrected data
  • Truth data also (obviously) in submission/report.Rmd
  • In finalise.R it would be nice to save a dated table of flags that can then be put into a table in report.Rmd

@kathsherratt
Copy link
Contributor Author

Nothing new but just dropping in here some useful resources for manual data sense-checks (not sure where else to keep this)
https://github.com/nytimes/covid-19-data/issues?q=is%3Aissue+label%3Adata-issue+
https://github.com/CSSEGISandData/COVID-19/issues

@kathsherratt
Copy link
Contributor Author

kathsherratt commented Mar 5, 2021

Flagging this function specifically for checking a range of methods for anomaly detection (used by Reich lab on US data):
https://github.com/reichlab/covidData/blob/master/R/identify_outliers.R
https://github.com/reichlab/covidData/blob/master/vignettes/outliers.R

Also, linking to #97 which looks to me like a near duplicate + expansion of this issue

@kathsherratt kathsherratt pinned this issue May 10, 2021
@kathsherratt
Copy link
Contributor Author

Pinning this issue. We will need to

  • load raw data and save this csv to a clear data-raw or similar folder
  • add a simple anomaly handling function which compares a value to both lead and lagged values.
  • save this csv into a separated data-modified or similar folder
  • then models should be fit to modified data, and evaluation all plotted against raw data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants