Analyze dataset edge case failure #45

paxcema · 2023-12-07T15:19:42Z

Triggered by the attached dataset:

    analyze_dataset(df)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/insights.py", line 39, in analyze_dataset
    stats = statistical_analysis(df, type_information.dtypes, args, type_information.identifiers)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/insights.py", line 61, in statistical_analysis
    df = cleaner(data, dtypes, args.get('pct_invalid', 0),
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 42, in cleaner
    data = _remove_columns(data, identifiers, target, mode, timeseries_settings,
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 379, in _remove_columns
    data = _rm_rows_w_empty_targets(data, target)
  File "/Users/MindsDB/dataprep_ml/dataprep_ml/cleaners.py", line 310, in _rm_rows_w_empty_targets
    df = df.dropna(subset=[target])
  File "/Users/MindsDB/env/lib/python3.9/site-packages/pandas/core/frame.py", line 6407, in dropna
    raise KeyError(np.array(subset)[check].tolist())
KeyError: ['saledate']

analyze_dataset_failure.csv

pedrofluxa · 2023-12-07T20:54:48Z

Hi there,

The file analyze_dataset_failure.csv does not seem to contain a dataset but rather something that looks like the output of running a prediction (using Lightwood, I presume). After a some amount of detective work, I am now quite sure that the data comes from the house_sales dataset, which is included in MindsDB as a tutorial.

I followed the tutorial (with very minor changes) and did not get any error messages. These are the commands I ran

CREATE DATABASE example_db
WITH ENGINE = "postgres",
PARAMETERS = {
    "user": "demo_user",
    "password": "demo_password",
    "host": "3.220.66.106",
    "port": "5432",
    "database": "demo"
};

SELECT * 
FROM example_db.demo_data.house_sales;

SELECT saledate, ma, type, bedrooms 
FROM example_db.demo_data.house_sales 
WHERE type='house' 
AND bedrooms=3;

CREATE MODEL 
  mindsdb.house_sales_model_demo
FROM example_db
  (SELECT * FROM demo_data.house_sales)
PREDICT ma
ORDER BY saledate
GROUP BY bedrooms, type
-- as the target is quarterly, we will look back two years to forecast the next one
WINDOW 8
HORIZON 4; 

SELECT m.saledate as date, m.ma as forecast
  FROM mindsdb.house_sales_model_demo as m 
  JOIN example_db.demo_data.house_sales as t
  WHERE t.saledate > '2019-01-01' AND t.type = 'house'
  AND t.bedrooms=2;

Mind the 2019-01-01; this is here to ensure that the output of the prediction starts before the first prediction in the file analyze_dataset_failure.csv.

What seems odd to me is that '2019-04-02' doesn't match with the input frequency of the data. See, all entries correspond to the results of a quarter, so the first entry of a year corresponds to 03-31, the second to 06-30, third to 09-30 and last one to 12-31; having a prediction with timestamp 2019-04-02 sort of doesn't make any sense.

Interestingly enough

the time difference between 2019-04-02 and 2019-03-31 is two days
2019-07-03 and 2019-06-30 -> 3 days
2019-10-03 and 2019-09-30 -> 3 days
2020-01-03 and 2019-12-31 -> 3 days

What's even more interesting is that the time difference between consecutive predictions is 92 days, which is certainly not "a quarter" as 92 x 4 = 368 which in turn happens to be 365 + 3!

Another interesting "feature" is that the SQL commands listed above yield predictions for

2019-03-30 (should be 2019-03-31)
2019-06-29 (should be 2019-06-30)
2019-09-29 (should be 2019-09-30)
2019-12-30 (should be 2019-12-31)

but for 2020

2020-03-31 (correct)
2020-07-01 (should be 2020-06-30)
2020-10-01 (should be 2020-09-30)

In summary

Cannot reproduce the error without knowing the exact dataset and the exact commands used to trigger it.
Efforts to reproduce the error show that there is something odd with the way predictions are generated. It is not clear what causes this behavior but I presume it is related to the fact that this time series is not evenly sampled (year "quarters" are not equally spaced in time, let alone leap years have an extra day!)
Both the data submitted by OP and the predictions from a fresh MindsDB installation are technically wrong; not terribly wrong, but off by a day.

I suggest to open a PR related to this issue to fix this problem by properly handling unevenly sampled data, either by using resampling or time-stamp matching (i.e. neglecting the offsets produced by the uneven sampling).

paxcema · 2023-12-08T03:00:59Z

What a great analysis @pedrofluxa 🚀! Apologies for not including additional information here, could have saved you a bunch of time.

Even though what you diagnosed is correct, I think it is appropriate to move it to a different issue, as OP is more concerned about the key error and the subsequent failure in dataprep itself, rather than the generated predictions.

pedrofluxa · 2023-12-08T13:06:13Z

No problem: detective work is always fun (and I learned a lot) :D

Got it. I'll investigate how to fix the particular problem described by OP. Would you please send the exact dataset and commands/code that OP used to trigger this error?

paxcema · 2023-12-11T14:54:29Z

Exact commands are not available anymore, but:

Your example above is essentially what I did, if memory serves right
It can be replicated by taking the attached dataset, then calling type_infer on it to get type_information, then statistical_analysis(df, type_information.dtypes, args, type_information.identifiers)

paxcema · 2023-12-11T14:56:52Z

Scratch that, managed to find it 😁:

CREATE MODEL mindsdb.house_sales_model
FROM example_db
  (SELECT * FROM demo_data.house_sales)
PREDICT ma
ORDER BY saledate
GROUP BY bedrooms, type
WINDOW 8
HORIZON 4;

DESCRIBE house_sales_model;

SELECT m.*
  FROM mindsdb.house_sales_model as m 
  JOIN example_db.demo_data.house_sales as t
  WHERE t.saledate > '2019-1-1' AND t.type = 'house'
  AND t.bedrooms=2;

pedrofluxa · 2023-12-11T17:19:55Z

Nice!

I am not able to reproduce the error using a clean installation of MindsDB. I actually get these results back

saledate	ma	type	bedrooms	__mdb_ts_previous_ma	select_data_query	when_data	ma_original	ma_timestep_1	ma_timestep_2	ma_timestep_3	ma_confidence	ma_explain	ma_anomaly	ma_min	ma_max
2019-03-31 00:00:00.000000	497311.59375	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 497311.59375, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 496542.375, "confidence_upper_bound": 498080.8125}	false	496542.375	498080.8125
2019-06-30 00:00:00.000000	504434.4375	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 504434.4375, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 503665.21875, "confidence_upper_bound": 505203.65625}	false	503665.21875	505203.65625
2019-09-30 00:00:00.000000	503483.71875	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 503483.71875, "confidence": 0.9991, "anomaly": false, "truth": null, "confidence_lower_bound": 502714.5, "confidence_upper_bound": 504252.9375}	false	502714.5	504252.9375
2019-12-31 00:00:00.000000	494577.75	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 494577.75, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 493206.28125, "confidence_upper_bound": 495949.21875}	[NULL]	493206.28125	495949.21875
2020-04-01 00:00:00.000000	496013.46875	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 496013.46875, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 489713.46875, "confidence_upper_bound": 502313.46875}	[NULL]	489713.46875	502313.46875
2020-07-02 00:00:00.000000	506423.15625	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 506423.15625, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 485278.640625, "confidence_upper_bound": 527567.671875}	[NULL]	485278.640625	527567.671875
2020-10-02 00:00:00.000000	504858.125	house	2	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	[NULL]	0.9991	{"predicted_value": 504858.125, "confidence": 0.9991, "anomaly": null, "truth": null, "confidence_lower_bound": 483713.609375, "confidence_upper_bound": 526002.640625}	[NULL]	483713.609375	526002.640625

I am not entirely sure if the missing values for ma_timestamp_{1,2,3} are a normal feature or point to something not working correctly.

paxcema · 2024-05-07T15:59:27Z

Ok, closing then 🤝

paxcema added the bug Something isn't working label Dec 7, 2023

tomhuds assigned paxcema and pedrofluxa Dec 7, 2023

paxcema closed this as completed May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze dataset edge case failure #45

Analyze dataset edge case failure #45

paxcema commented Dec 7, 2023

pedrofluxa commented Dec 7, 2023

paxcema commented Dec 8, 2023

pedrofluxa commented Dec 8, 2023

paxcema commented Dec 11, 2023

paxcema commented Dec 11, 2023

pedrofluxa commented Dec 11, 2023

paxcema commented May 7, 2024

Analyze dataset edge case failure #45

Analyze dataset edge case failure #45

Comments

paxcema commented Dec 7, 2023

pedrofluxa commented Dec 7, 2023

paxcema commented Dec 8, 2023

pedrofluxa commented Dec 8, 2023

paxcema commented Dec 11, 2023

paxcema commented Dec 11, 2023

pedrofluxa commented Dec 11, 2023

paxcema commented May 7, 2024