Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can't read large stata file #25772

Closed
bdemeshev opened this issue Mar 19, 2019 · 8 comments · Fixed by #25968
Closed

can't read large stata file #25772

bdemeshev opened this issue Mar 19, 2019 · 8 comments · Fixed by #25968
Labels
IO Stata read_stata, to_stata
Milestone

Comments

@bdemeshev
Copy link

bdemeshev commented Mar 19, 2019

I am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.

ind_dta = pd.read_stata('USER_RLMS-HSE_IND_1994_2017_v2_eng.dta')

This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the IO Stata read_stata, to_stata label Mar 19, 2019
@gfyoung
Copy link
Member

gfyoung commented Mar 19, 2019

cc @bashtage

@bdemeshev : Can you post the traceback for this?

@gfyoung gfyoung added Needs Info Clarification about behavior needed to assess issue and removed Needs Info Clarification about behavior needed to assess issue labels Mar 19, 2019
@bdemeshev
Copy link
Author

Have updated all the packages. The error has changed.

traceback


ValueError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _do_convert_categoricals(self, data, value_label_dict, lbllist, order_categoricals)
1710 try:
-> 1711 cat_data.categories = categories
1712 except ValueError:

~/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py in categories(self, categories)
426 def categories(self, categories):
--> 427 new_dtype = CategoricalDtype(categories, ordered=self.ordered)
428 if (self.dtype.categories is not None and

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in init(self, categories, ordered)
225 def init(self, categories=None, ordered=None):
--> 226 self._finalize(categories, ordered, fastpath=False)
227

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in _finalize(self, categories, ordered, fastpath)
334 categories = self.validate_categories(categories,
--> 335 fastpath=fastpath)
336

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
513 if not categories.is_unique:
--> 514 raise ValueError('Categorical categories must be unique')
515

ValueError: Categorical categories must be unique

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
----> 1 d = pd.read_stata("USER_RLMS-HSE_IND_1994_2017_v2_eng.dta")

~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
--> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
--> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
190 else:
191 try:
--> 192 data = reader.read()
193 finally:
194 reader.close()

~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
--> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals)
1593 self.value_label_dict,
1594 self.lbllist,
-> 1595 order_categoricals)
1596
1597 if not preserve_dtypes:

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _do_convert_categoricals(self, data, value_label_dict, lbllist, order_categoricals)
1717 'unique. The repeated labels are:\n'
1718 '{repeats}'
-> 1719 .format(col=col, repeats=repeats))
1720 # TODO: is the next line needed above in the data(...) method?
1721 cat_data = Series(cat_data, index=data.index)

ValueError: Value labels for column I4 are not unique. The repeated labels are:

--------------------------------------------------------------------------------european
tsumadin

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@bashtage
Copy link
Contributor

Seems to require a log-in to get the data file. Can you share it by some other means?

@bdemeshev
Copy link
Author

Yes it requires login, but the registration is completely free. I think it will not be an abuse if I share it temporary for a couple of hours.

@bashtage
Copy link
Contributor

I have downloaded it. Takes a long time to load in pandas.

One hint that the file is not strictly valid: in Stata you get

(label K8_1 already defined)
(label K8_2 already defined)

when loading the data. Doesn't mean it should crash necessarily.

@bashtage
Copy link
Contributor

I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don't require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize.

A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000.

For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem.

@bdemeshev
Copy link
Author

Ok! Thanks for pointing the low level interface! Will try it :)

@bashtage
Copy link
Contributor

The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don't support this. A work around:

from pandas.io.stata import StataReader
file_name = r'C:\temp\USER_RLMS-HSE_IND_1994_2017_v2_eng.dta'
sr = StataReader(file_name, convert_missing=False, chunksize=1000, convert_categoricals=False)
labels = sr.value_labels()  # To use later
for block in sr:
    temp = block
    break

You will then have to apply labels yourself, if you need them.

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
@jreback jreback added this to the 0.25.0 milestone Mar 19, 2019
bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019
Improve performance of StataReader when converting columns
with missing values

xref pandas-dev#25772
jreback pushed a commit that referenced this issue Mar 20, 2019
Improve performance of StataReader when converting columns
with missing values

xref #25772
bashtage added a commit to bashtage/pandas that referenced this issue Apr 2, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Apr 2, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Apr 3, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Apr 4, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes pandas-dev#25772
bashtage added a commit to bashtage/pandas that referenced this issue Apr 4, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes pandas-dev#25772
jreback pushed a commit that referenced this issue Apr 5, 2019
Improve the explanation when value labels are repeated in Stata dta files.
Add suggested methods to workaround the issue using the low level
interface.

closes #25772
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants