PERF: Improve performance of StataReader's processing of missing values #25780

bashtage · 2019-03-19T11:35:28Z

Improve performance of StataReader when converting columns
with missing values

closes #N/A
tests added / passed N/A, perf only
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback · 2019-03-19T12:34:29Z

can you add an asv that covers this case?

jreback · 2019-03-19T12:34:59Z

pandas/io/stata.py

+        if replacements:
+            columns = data.columns
+            replacements = DataFrame(replacements)
+            data.drop(replacements.columns, 1, inplace=True)


don't use inplace

I used inplace intentionally to try and minimize memory consumption, which can be an issue when loading some Stata files. Is it going away?

its not idiomatic, nor does it actually do anything here (it rarely does); we don't use it internally if at all possible.

pep8speaks · 2019-03-19T13:00:11Z

Hello @bashtage! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-20 13:16:12 UTC

jreback · 2019-03-19T18:18:35Z

can you merge master

codecov · 2019-03-19T23:02:32Z

Codecov Report

Merging #25780 into master will decrease coverage by 49.51%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #25780       +/-   ##
===========================================
- Coverage   91.26%   41.74%   -49.52%     
===========================================
  Files         172      172               
  Lines       52965    52965               
===========================================
- Hits        48337    22109    -26228     
- Misses       4628    30856    +26228

Flag	Coverage Δ
#multiple	`?`
#single	`41.74% <ø> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.36%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-95.46%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.17%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.15%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
pandas/io/json/normalize.py	`8.16% <0%> (-88.78%)`	⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e54e55...2eda26b. Read the comment docs.

codecov · 2019-03-19T23:02:35Z

Codecov Report

Merging #25780 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #25780   +/-   ##
=======================================
  Coverage   91.27%   91.27%           
=======================================
  Files         173      173           
  Lines       53002    53002           
=======================================
  Hits        48375    48375           
  Misses       4627     4627

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.77% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37d04a3...e5f3c06. Read the comment docs.

jreback · 2019-03-19T23:25:34Z

lgtm. ping on green.

bashtage · 2019-03-20T09:56:36Z

@jreback I think the failures are unrelated to my changes.

jreback · 2019-03-20T12:17:02Z

master is passing, so maybe you have an older build. pls merge master and let's see.

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage · 2019-03-20T14:06:21Z

@jreback Green.

jreback · 2019-03-20T14:06:55Z

thanks @bashtage does this fully close the issue?

bashtage · 2019-03-20T14:20:33Z

No, there is still an issue with duplicate valuelables which can't be handeled by categories which require unique labels. There are 2 solutions to that: 1. mung the value label and preserve the data (an integer) or 2. assign all values with the same label to be the same. 1 is probably better since there isn't much loss of fidelity (can just find the munged names and fix, plus the underlying integer values are still correct for the labeled categories). But this needs more thought and work.

* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...

jreback added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Mar 19, 2019

jreback requested changes Mar 19, 2019

View reviewed changes

bashtage force-pushed the stata-missing-perf branch from cc999d9 to a279af4 Compare March 19, 2019 13:00

bashtage force-pushed the stata-missing-perf branch from a279af4 to ba7232c Compare March 19, 2019 13:01

bashtage force-pushed the stata-missing-perf branch 2 times, most recently from b303857 to 2eda26b Compare March 19, 2019 23:02

bashtage force-pushed the stata-missing-perf branch from 2eda26b to a6aaba0 Compare March 19, 2019 23:12

jreback added this to the 0.25.0 milestone Mar 19, 2019

jreback approved these changes Mar 19, 2019

View reviewed changes

bashtage force-pushed the stata-missing-perf branch from a6aaba0 to 36496c1 Compare March 20, 2019 08:23

bashtage force-pushed the stata-missing-perf branch from 36496c1 to e2bd4dd Compare March 20, 2019 12:28

PERF: Improve performance of StataReader

e5f3c06

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage force-pushed the stata-missing-perf branch from e2bd4dd to e5f3c06 Compare March 20, 2019 13:16

jreback merged commit 85c3f82 into pandas-dev:master Mar 20, 2019

bashtage deleted the stata-missing-perf branch March 21, 2019 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve performance of StataReader's processing of missing values #25780

PERF: Improve performance of StataReader's processing of missing values #25780

bashtage commented Mar 19, 2019

jreback commented Mar 19, 2019

jreback Mar 19, 2019

bashtage Mar 19, 2019

jreback Mar 19, 2019 •

edited

Loading

pep8speaks commented Mar 19, 2019 •

edited

Loading

jreback commented Mar 19, 2019

codecov bot commented Mar 19, 2019

codecov bot commented Mar 19, 2019 •

edited

Loading

jreback commented Mar 19, 2019

bashtage commented Mar 20, 2019

jreback commented Mar 20, 2019

bashtage commented Mar 20, 2019

jreback commented Mar 20, 2019

bashtage commented Mar 20, 2019

PERF: Improve performance of StataReader's processing of missing values #25780

PERF: Improve performance of StataReader's processing of missing values #25780

Conversation

bashtage commented Mar 19, 2019

jreback commented Mar 19, 2019

jreback Mar 19, 2019

Choose a reason for hiding this comment

bashtage Mar 19, 2019

Choose a reason for hiding this comment

jreback Mar 19, 2019 • edited Loading

Choose a reason for hiding this comment

pep8speaks commented Mar 19, 2019 • edited Loading

Comment last updated at 2019-03-20 13:16:12 UTC

jreback commented Mar 19, 2019

codecov bot commented Mar 19, 2019

Codecov Report

codecov bot commented Mar 19, 2019 • edited Loading

Codecov Report

jreback commented Mar 19, 2019

bashtage commented Mar 20, 2019

jreback commented Mar 20, 2019

bashtage commented Mar 20, 2019

jreback commented Mar 20, 2019

bashtage commented Mar 20, 2019

jreback Mar 19, 2019 •

edited

Loading

pep8speaks commented Mar 19, 2019 •

edited

Loading

codecov bot commented Mar 19, 2019 •

edited

Loading