Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze dataset edge case failure #45

Closed
paxcema opened this issue Dec 7, 2023 · 7 comments
Closed

Analyze dataset edge case failure #45

paxcema opened this issue Dec 7, 2023 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@paxcema
Copy link
Contributor

paxcema commented Dec 7, 2023

Triggered by the attached dataset:

    analyze_dataset(df)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/insights.py", line 39, in analyze_dataset
    stats = statistical_analysis(df, type_information.dtypes, args, type_information.identifiers)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/insights.py", line 61, in statistical_analysis
    df = cleaner(data, dtypes, args.get('pct_invalid', 0),
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 42, in cleaner
    data = _remove_columns(data, identifiers, target, mode, timeseries_settings,
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 379, in _remove_columns
    data = _rm_rows_w_empty_targets(data, target)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 310, in _rm_rows_w_empty_targets
    df = df.dropna(subset=[target])
  File "/Users/MindsDB/env/lib/python3.9/site-packages/pandas/core/frame.py", line 6407, in dropna
    raise KeyError(np.array(subset)[check].tolist())
KeyError: ['saledate']

analyze_dataset_failure.csv

@paxcema paxcema added the bug Something isn't working label Dec 7, 2023
@pedrofluxa
Copy link
Contributor

Hi there,

The file analyze_dataset_failure.csv does not seem to contain a dataset but rather something that looks like the output of running a prediction (using Lightwood, I presume). After a some amount of detective work, I am now quite sure that the data comes from the house_sales dataset, which is included in MindsDB as a tutorial.

I followed the tutorial (with very minor changes) and did not get any error messages. These are the commands I ran

CREATE DATABASE example_db
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "3.220.66.106",
    "port": "5432",
    "database": "demo"
};

SELECT * 
FROM example_db.demo_data.house_sales;

SELECT saledate, ma, type, bedrooms 
FROM example_db.demo_data.house_sales 
WHERE type='house' 
AND bedrooms=3;

CREATE MODEL 
  mindsdb.house_sales_model_demo
FROM example_db
  (SELECT * FROM demo_data.house_sales)
PREDICT ma
ORDER BY saledate
GROUP BY bedrooms, type
-- as the target is quarterly, we will look back two years to forecast the next one
WINDOW 8
HORIZON 4; 

SELECT m.saledate as date, m.ma as forecast
  FROM mindsdb.house_sales_model_demo as m 
  JOIN example_db.demo_data.house_sales as t
  WHERE t.saledate > '2019-01-01' AND t.type = 'house'
  AND t.bedrooms=2;

Mind the 2019-01-01; this is here to ensure that the output of the prediction starts before the first prediction in the file analyze_dataset_failure.csv.

2019-04-02
2019-07-03
2019-10-03
2020-01-03

What seems odd to me is that '2019-04-02' doesn't match with the input frequency of the data. See, all entries correspond to the results of a quarter, so the first entry of a year corresponds to 03-31, the second to 06-30, third to 09-30 and last one to 12-31; having a prediction with timestamp 2019-04-02 sort of doesn't make any sense.

Interestingly enough

  • the time difference between 2019-04-02 and 2019-03-31 is two days
  • 2019-07-03 and 2019-06-30 -> 3 days
  • 2019-10-03 and 2019-09-30 -> 3 days
  • 2020-01-03 and 2019-12-31 -> 3 days

What's even more interesting is that the time difference between consecutive predictions is 92 days, which is certainly not "a quarter" as 92 x 4 = 368 which in turn happens to be 365 + 3!

Another interesting "feature" is that the SQL commands listed above yield predictions for

2019-03-30 (should be 2019-03-31)
2019-06-29 (should be 2019-06-30)
2019-09-29 (should be 2019-09-30)
2019-12-30 (should be 2019-12-31)

but for 2020

2020-03-31 (correct)
2020-07-01 (should be 2020-06-30)
2020-10-01 (should be 2020-09-30)

In summary

  • Cannot reproduce the error without knowing the exact dataset and the exact commands used to trigger it.
  • Efforts to reproduce the error show that there is something odd with the way predictions are generated. It is not clear what causes this behavior but I presume it is related to the fact that this time series is not evenly sampled (year "quarters" are not equally spaced in time, let alone leap years have an extra day!)
  • Both the data submitted by OP and the predictions from a fresh MindsDB installation are technically wrong; not terribly wrong, but off by a day.

I suggest to open a PR related to this issue to fix this problem by properly handling unevenly sampled data, either by using resampling or time-stamp matching (i.e. neglecting the offsets produced by the uneven sampling).

@paxcema
Copy link
Contributor Author

paxcema commented Dec 8, 2023

What a great analysis @pedrofluxa 🚀! Apologies for not including additional information here, could have saved you a bunch of time.

Even though what you diagnosed is correct, I think it is appropriate to move it to a different issue, as OP is more concerned about the key error and the subsequent failure in dataprep itself, rather than the generated predictions.

@pedrofluxa
Copy link
Contributor

No problem: detective work is always fun (and I learned a lot) :D

Got it. I'll investigate how to fix the particular problem described by OP. Would you please send the exact dataset and commands/code that OP used to trigger this error?

@paxcema
Copy link
Contributor Author

paxcema commented Dec 11, 2023

Exact commands are not available anymore, but:

  1. Your example above is essentially what I did, if memory serves right
  2. It can be replicated by taking the attached dataset, then calling type_infer on it to get type_information, then statistical_analysis(df, type_information.dtypes, args, type_information.identifiers)

@paxcema
Copy link
Contributor Author

paxcema commented Dec 11, 2023

Scratch that, managed to find it 😁:

CREATE MODEL mindsdb.house_sales_model
FROM example_db
  (SELECT * FROM demo_data.house_sales)
PREDICT ma
ORDER BY saledate
GROUP BY bedrooms, type
WINDOW 8
HORIZON 4;

DESCRIBE house_sales_model;

SELECT m.*
  FROM mindsdb.house_sales_model as m 
  JOIN example_db.demo_data.house_sales as t
  WHERE t.saledate > '2019-1-1' AND t.type = 'house'
  AND t.bedrooms=2;

@pedrofluxa
Copy link
Contributor

Nice!

I am not able to reproduce the error using a clean installation of MindsDB. I actually get these results back

saledate ma type bedrooms __mdb_ts_previous_ma select_data_query when_data ma_original ma_timestep_1 ma_timestep_2 ma_timestep_3 ma_confidence ma_explain ma_anomaly ma_min ma_max
2019-03-31 00:00:00.000000 497311.59375 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 497311.59375, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 496542.375, "confidence_upper_bound": 498080.8125} false 496542.375 498080.8125
2019-06-30 00:00:00.000000 504434.4375 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 504434.4375, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 503665.21875, "confidence_upper_bound": 505203.65625} false 503665.21875 505203.65625
2019-09-30 00:00:00.000000 503483.71875 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 503483.71875, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 502714.5, "confidence_upper_bound": 504252.9375} false 502714.5 504252.9375
2019-12-31 00:00:00.000000 494577.75 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 494577.75, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 493206.28125, "confidence_upper_bound": 495949.21875} [NULL] 493206.28125 495949.21875
2020-04-01 00:00:00.000000 496013.46875 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 496013.46875, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 489713.46875, "confidence_upper_bound": 502313.46875} [NULL] 489713.46875 502313.46875
2020-07-02 00:00:00.000000 506423.15625 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 506423.15625, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 485278.640625, "confidence_upper_bound": 527567.671875} [NULL] 485278.640625 527567.671875
2020-10-02 00:00:00.000000 504858.125 house 2 [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] [NULL] 0.9991 {"predicted_value": 504858.125, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 483713.609375, "confidence_upper_bound": 526002.640625} [NULL] 483713.609375 526002.640625

I am not entirely sure if the missing values for ma_timestamp_{1,2,3} are a normal feature or point to something not working correctly.

@paxcema
Copy link
Contributor Author

paxcema commented May 7, 2024

Ok, closing then 🤝

@paxcema paxcema closed this as completed May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants