can't read large stata file #25772

bdemeshev · 2019-03-19T05:34:31Z

I am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.

ind_dta = pd.read_stata('USER_RLMS-HSE_IND_1994_2017_v2_eng.dta')

This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2019-03-19T06:01:26Z

cc @bashtage

@bdemeshev : Can you post the traceback for this?

bdemeshev · 2019-03-19T06:55:07Z

Have updated all the packages. The error has changed.

traceback

ValueError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _do_convert_categoricals(self, data, value_label_dict, lbllist, order_categoricals)
1710 try:
-> 1711 cat_data.categories = categories
1712 except ValueError:

~/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py in categories(self, categories)
426 def categories(self, categories):
--> 427 new_dtype = CategoricalDtype(categories, ordered=self.ordered)
428 if (self.dtype.categories is not None and

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in init(self, categories, ordered)
225 def init(self, categories=None, ordered=None):
--> 226 self._finalize(categories, ordered, fastpath=False)
227

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in _finalize(self, categories, ordered, fastpath)
334 categories = self.validate_categories(categories,
--> 335 fastpath=fastpath)
336

~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath)
513 if not categories.is_unique:
--> 514 raise ValueError('Categorical categories must be unique')
515

ValueError: Categorical categories must be unique

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
in
----> 1 d = pd.read_stata("USER_RLMS-HSE_IND_1994_2017_v2_eng.dta")

~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
--> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator)
190 else:
191 try:
--> 192 data = reader.read()
193 finally:
194 reader.close()

~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
186 else:
187 kwargs[new_arg_name] = new_arg_value
--> 188 return func(*args, **kwargs)
189 return wrapper
190 return _deprecate_kwarg

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals)
1593 self.value_label_dict,
1594 self.lbllist,
-> 1595 order_categoricals)
1596
1597 if not preserve_dtypes:

~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _do_convert_categoricals(self, data, value_label_dict, lbllist, order_categoricals)
1717 'unique. The repeated labels are:\n'
1718 '{repeats}'
-> 1719 .format(col=col, repeats=repeats))
1720 # TODO: is the next line needed above in the data(...) method?
1721 cat_data = Series(cat_data, index=data.index)

ValueError: Value labels for column I4 are not unique. The repeated labels are:

--------------------------------------------------------------------------------european
tsumadin

`pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

bashtage · 2019-03-19T07:58:13Z

Seems to require a log-in to get the data file. Can you share it by some other means?

bdemeshev · 2019-03-19T08:12:33Z

Yes it requires login, but the registration is completely free. I think it will not be an abuse if I share it temporary for a couple of hours.

bashtage · 2019-03-19T09:14:27Z

I have downloaded it. Takes a long time to load in pandas.

One hint that the file is not strictly valid: in Stata you get

(label K8_1 already defined)
(label K8_2 already defined)

when loading the data. Doesn't mean it should crash necessarily.

bashtage · 2019-03-19T10:13:53Z

I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don't require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize.

A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000.

For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem.

bdemeshev · 2019-03-19T10:57:19Z

Ok! Thanks for pointing the low level interface! Will try it :)

bashtage · 2019-03-19T11:02:10Z

The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don't support this. A work around:

from pandas.io.stata import StataReader
file_name = r'C:\temp\USER_RLMS-HSE_IND_1994_2017_v2_eng.dta'
sr = StataReader(file_name, convert_missing=False, chunksize=1000, convert_categoricals=False)
labels = sr.value_labels()  # To use later
for block in sr:
    temp = block
    break

You will then have to apply labels yourself, if you need them.

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

Improve performance of StataReader when converting columns with missing values xref #25772

Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772

Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes #25772

gfyoung added the IO Stata read_stata, to_stata label Mar 19, 2019

gfyoung added Needs Info Clarification about behavior needed to assess issue and removed Needs Info Clarification about behavior needed to assess issue labels Mar 19, 2019

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

8e27a4b

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

cc999d9

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage mentioned this issue Mar 19, 2019

PERF: Improve performance of StataReader's processing of missing values #25780

Merged

4 tasks

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

a279af4

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

ba7232c

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

b303857

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

2eda26b

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 19, 2019

PERF: Improve performance of StataReader

a6aaba0

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

jreback added this to the 0.25.0 milestone Mar 19, 2019

bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019

PERF: Improve performance of StataReader

36496c1

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019

PERF: Improve performance of StataReader

e2bd4dd

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

bashtage added a commit to bashtage/pandas that referenced this issue Mar 20, 2019

PERF: Improve performance of StataReader

e5f3c06

Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772

jreback pushed a commit that referenced this issue Mar 20, 2019

PERF: Improve performance of StataReader (#25780)

85c3f82

Improve performance of StataReader when converting columns with missing values xref #25772

bashtage mentioned this issue Apr 2, 2019

ENH: Improve explanation when erroring on dta files #25968

Merged

4 tasks

jreback closed this as completed in #25968 Apr 5, 2019

bdemeshev mentioned this issue Oct 30, 2019

Can't import big (0.9gb) spss or stata file tidyverse/haven#441

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't read large stata file #25772

can't read large stata file #25772

bdemeshev commented Mar 19, 2019 •

edited

Loading

INSTALLED VERSIONS

gfyoung commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

INSTALLED VERSIONS

bashtage commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

bashtage commented Mar 19, 2019

bashtage commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

bashtage commented Mar 19, 2019

can't read large stata file #25772

can't read large stata file #25772

Comments

bdemeshev commented Mar 19, 2019 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

gfyoung commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

traceback

pd.show_versions()

INSTALLED VERSIONS

bashtage commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

bashtage commented Mar 19, 2019

bashtage commented Mar 19, 2019

bdemeshev commented Mar 19, 2019

bashtage commented Mar 19, 2019

bdemeshev commented Mar 19, 2019 •

edited

Loading

Output of `pd.show_versions()`

`pd.show_versions()`