fix(csv): Do not coerce persisted data integer columns to float #20760

john-bodley · 2022-07-19T03:20:55Z

SUMMARY

Regrettably #20151 wasn't suffice is the result set was stored prior to downloading the CSV file. More specifically Pandas coerces an integer array with None to a float—likely because of the Numpy coercion, i.e.,

>>> pd.DataFrame.from_records([{"foo": 1}, {"foo": None}])
   foo
0  1.0
1  NaN

The fix is to explicitly define the dtype, using the standard DataFrame constructor, i.e.,

>>> pd.DataFrame(data=[{"foo": 1}, {"foo": None}], dtype=object)
    foo
0     1
1  None

Long term we should probably replace quirky Pandas with PyArrow globally.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

CI.

ADDITIONAL INFORMATION

Has associated issue:
Required feature flags:
Changes UI
Includes DB Migration (follow approval process in SIP-59)
- Migration is atomic, supports rollback & is backwards-compatible
- Confirm DB migration upgrade and downgrade tested
- Runtime estimates and downtime expectations provided
Introduces new feature or API
Removes existing feature or API

codecov · 2022-07-19T03:27:06Z

Codecov Report

Merging #20760 (4d2f439) into master (e60083b) will decrease coverage by 11.48%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master   #20760       +/-   ##
===========================================
- Coverage   66.35%   54.87%   -11.49%     
===========================================
  Files        1754     1754               
  Lines       66689    66688        -1     
  Branches     7049     7049               
===========================================
- Hits        44253    36595     -7658     
- Misses      20639    28296     +7657     
  Partials     1797     1797

Flag	Coverage Δ
hive	`53.23% <0.00%> (+<0.01%)`	⬆️
mysql	`?`
postgres	`?`
presto	`53.09% <0.00%> (+<0.01%)`	⬆️
python	`58.00% <0.00%> (-23.69%)`	⬇️
sqlite	`?`
unit	`50.57% <0.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/views/core.py	`34.46% <0.00%> (-43.43%)`	⬇️
superset/utils/dashboard_import_export.py	`0.00% <0.00%> (-100.00%)`	⬇️
superset/key_value/commands/update.py	`0.00% <0.00%> (-88.89%)`	⬇️
superset/key_value/commands/delete.py	`0.00% <0.00%> (-85.30%)`	⬇️
superset/key_value/commands/delete_expired.py	`0.00% <0.00%> (-80.77%)`	⬇️
superset/dashboards/commands/importers/v0.py	`15.62% <0.00%> (-76.25%)`	⬇️
superset/datasets/commands/update.py	`25.30% <0.00%> (-68.68%)`	⬇️
superset/datasets/commands/create.py	`29.41% <0.00%> (-68.63%)`	⬇️
superset/datasets/commands/importers/v0.py	`24.03% <0.00%> (-67.45%)`	⬇️
superset/reports/commands/execute.py	`24.45% <0.00%> (-67.16%)`	⬇️
... and 275 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e60083b...4d2f439. Read the comment docs.

…he#20760) * Replace pd.DataFrame.from_records with pd.DataFrame * Remove unused code * Update core.py * Update core.py * Update csv.py * Update core.py (cherry picked from commit e1fd906)

mbcsa · 2022-07-29T17:27:00Z

Hi @john-bodley

This fix introduces a new problem when user exports CSV file from a cached Query.
I've created a new issue #20919

The thing is, when Dataframe is created dinamically from cached data, it is not respecting column formats.
This is a problem when decimal separator is configured by CSV_EXPORT, "sep" attribute

I'm testing this, and it works well when changing:

df = pd.DataFrame(
    data=obj["data"],
    dtype=object,
    columns=[c["name"] for c in obj["columns"]],
)

to

df = pd.DataFrame(
    data=obj["data"],
    columns=[c["name"] for c in obj["columns"]],
)

Thank you

john-bodley added 2 commits July 18, 2022 20:12

Replace pd.DataFrame.from_records with pd.DataFrame

cb7d25c

Remove unused code

2915836

pull-request-size bot added the size/XS label Jul 19, 2022

john-bodley changed the title ~~John bodley fix 20151~~ fix(csv): Do not coerce persisted data integer columns to float Jul 19, 2022

john-bodley added 2 commits July 18, 2022 20:36

Update core.py

c6110c7

Update core.py

2a4686e

pull-request-size bot added size/S and removed size/XS labels Jul 19, 2022

john-bodley requested review from betodealmeida and ktmud July 19, 2022 04:21

ktmud approved these changes Jul 19, 2022

View reviewed changes

Update csv.py

9ca2df2

pull-request-size bot added size/XS and removed size/S labels Jul 19, 2022

Update core.py

4d2f439

john-bodley merged commit e1fd906 into master Jul 19, 2022

mbcsa mentioned this pull request Jul 29, 2022

Export to CSV with cached data is converted to Panda Dataframe without column formats #20919

Closed

3 tasks

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.1.0 labels Mar 13, 2024

mistercrunch deleted the john-bodley--fix-20151 branch March 26, 2024 16:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(csv): Do not coerce persisted data integer columns to float #20760

fix(csv): Do not coerce persisted data integer columns to float #20760

john-bodley commented Jul 19, 2022 •

edited

Loading

codecov bot commented Jul 19, 2022 •

edited

Loading

mbcsa commented Jul 29, 2022

fix(csv): Do not coerce persisted data integer columns to float #20760

fix(csv): Do not coerce persisted data integer columns to float #20760

Conversation

john-bodley commented Jul 19, 2022 • edited Loading

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

codecov bot commented Jul 19, 2022 • edited Loading

Codecov Report

mbcsa commented Jul 29, 2022

john-bodley commented Jul 19, 2022 •

edited

Loading

codecov bot commented Jul 19, 2022 •

edited

Loading