-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Improve performance of StataReader's processing of missing values #25780
Conversation
can you add an asv that covers this case? |
pandas/io/stata.py
Outdated
if replacements: | ||
columns = data.columns | ||
replacements = DataFrame(replacements) | ||
data.drop(replacements.columns, 1, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use inplace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used inplace
intentionally to try and minimize memory consumption, which can be an issue when loading some Stata files. Is it going away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its not idiomatic, nor does it actually do anything here (it rarely does); we don't use it internally if at all possible.
cc999d9
to
a279af4
Compare
a279af4
to
ba7232c
Compare
can you merge master |
b303857
to
2eda26b
Compare
Codecov Report
@@ Coverage Diff @@
## master #25780 +/- ##
===========================================
- Coverage 91.26% 41.74% -49.52%
===========================================
Files 172 172
Lines 52965 52965
===========================================
- Hits 48337 22109 -26228
- Misses 4628 30856 +26228
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #25780 +/- ##
=======================================
Coverage 91.27% 91.27%
=======================================
Files 173 173
Lines 53002 53002
=======================================
Hits 48375 48375
Misses 4627 4627
Continue to review full report at Codecov.
|
2eda26b
to
a6aaba0
Compare
lgtm. ping on green. |
a6aaba0
to
36496c1
Compare
@jreback I think the failures are unrelated to my changes. |
master is passing, so maybe you have an older build. pls merge master and let's see. |
36496c1
to
e2bd4dd
Compare
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
e2bd4dd
to
e5f3c06
Compare
@jreback Green. |
thanks @bashtage does this fully close the issue? |
No, there is still an issue with duplicate valuelables which can't be handeled by categories which require unique labels. There are 2 solutions to that: 1. mung the value label and preserve the data (an integer) or 2. assign all values with the same label to be the same. 1 is probably better since there isn't much loss of fidelity (can just find the munged names and fix, plus the underlying integer values are still correct for the labeled categories). But this needs more thought and work. |
* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...
Improve performance of StataReader when converting columns
with missing values
xref #25772
git diff upstream/master -u -- "*.py" | flake8 --diff