Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve performance of StataReader's processing of missing values #25780

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

bashtage
Copy link
Contributor

Improve performance of StataReader when converting columns
with missing values

xref #25772

  • closes #N/A
  • tests added / passed N/A, perf only
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@jreback jreback added Performance Memory or execution speed performance IO Stata read_stata, to_stata labels Mar 19, 2019
@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

can you add an asv that covers this case?

if replacements:
columns = data.columns
replacements = DataFrame(replacements)
data.drop(replacements.columns, 1, inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use inplace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used inplace intentionally to try and minimize memory consumption, which can be an issue when loading some Stata files. Is it going away?

Copy link
Contributor

@jreback jreback Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not idiomatic, nor does it actually do anything here (it rarely does); we don't use it internally if at all possible.

@bashtage bashtage force-pushed the stata-missing-perf branch from cc999d9 to a279af4 Compare March 19, 2019 13:00
@pep8speaks
Copy link

pep8speaks commented Mar 19, 2019

Hello @bashtage! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-20 13:16:12 UTC

@bashtage bashtage force-pushed the stata-missing-perf branch from a279af4 to ba7232c Compare March 19, 2019 13:01
@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

can you merge master

@bashtage bashtage force-pushed the stata-missing-perf branch 2 times, most recently from b303857 to 2eda26b Compare March 19, 2019 23:02
@codecov
Copy link

codecov bot commented Mar 19, 2019

Codecov Report

Merging #25780 into master will decrease coverage by 49.51%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##           master   #25780       +/-   ##
===========================================
- Coverage   91.26%   41.74%   -49.52%     
===========================================
  Files         172      172               
  Lines       52965    52965               
===========================================
- Hits        48337    22109    -26228     
- Misses       4628    30856    +26228
Flag Coverage Δ
#multiple ?
#single 41.74% <ø> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/io/formats/latex.py 0% <0%> (-100%) ⬇️
pandas/io/sas/sas_constants.py 0% <0%> (-100%) ⬇️
pandas/tseries/plotting.py 0% <0%> (-100%) ⬇️
pandas/tseries/converter.py 0% <0%> (-100%) ⬇️
pandas/io/formats/html.py 0% <0%> (-99.36%) ⬇️
pandas/core/groupby/categorical.py 0% <0%> (-95.46%) ⬇️
pandas/io/sas/sas7bdat.py 0% <0%> (-91.17%) ⬇️
pandas/io/sas/sas_xport.py 0% <0%> (-90.15%) ⬇️
pandas/core/tools/numeric.py 10.44% <0%> (-89.56%) ⬇️
pandas/io/json/normalize.py 8.16% <0%> (-88.78%) ⬇️
... and 129 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e54e55...2eda26b. Read the comment docs.

@codecov
Copy link

codecov bot commented Mar 19, 2019

Codecov Report

Merging #25780 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #25780   +/-   ##
=======================================
  Coverage   91.27%   91.27%           
=======================================
  Files         173      173           
  Lines       53002    53002           
=======================================
  Hits        48375    48375           
  Misses       4627     4627
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.77% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37d04a3...e5f3c06. Read the comment docs.

@bashtage bashtage force-pushed the stata-missing-perf branch from 2eda26b to a6aaba0 Compare March 19, 2019 23:12
@jreback jreback added this to the 0.25.0 milestone Mar 19, 2019
@jreback
Copy link
Contributor

jreback commented Mar 19, 2019

lgtm. ping on green.

@bashtage bashtage force-pushed the stata-missing-perf branch from a6aaba0 to 36496c1 Compare March 20, 2019 08:23
@bashtage
Copy link
Contributor Author

@jreback I think the failures are unrelated to my changes.

@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

master is passing, so maybe you have an older build. pls merge master and let's see.

@bashtage bashtage force-pushed the stata-missing-perf branch from 36496c1 to e2bd4dd Compare March 20, 2019 12:28
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
@bashtage bashtage force-pushed the stata-missing-perf branch from e2bd4dd to e5f3c06 Compare March 20, 2019 13:16
@bashtage
Copy link
Contributor Author

@jreback Green.

@jreback jreback merged commit 85c3f82 into pandas-dev:master Mar 20, 2019
@jreback
Copy link
Contributor

jreback commented Mar 20, 2019

thanks @bashtage does this fully close the issue?

@bashtage
Copy link
Contributor Author

No, there is still an issue with duplicate valuelables which can't be handeled by categories which require unique labels. There are 2 solutions to that: 1. mung the value label and preserve the data (an integer) or 2. assign all values with the same label to be the same. 1 is probably better since there isn't much loss of fidelity (can just find the munged names and fix, plus the underlying integer values are still correct for the labeled categories). But this needs more thought and work.

thoo added a commit to thoo/pandas that referenced this pull request Mar 20, 2019
* upstream/master: (55 commits)
  PERF: Improve performance of StataReader (pandas-dev#25780)
  Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784)
  BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588)
  BUG-24971 copying blocks also considers ndim (pandas-dev#25521)
  CLN: Panel reference from documentation (pandas-dev#25649)
  ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955)
  BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  ...
@bashtage bashtage deleted the stata-missing-perf branch March 21, 2019 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants