-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't read large stata file #25772
Comments
cc @bashtage @bdemeshev : Can you post the traceback for this? |
Have updated all the packages. The error has changed. tracebackValueError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/pandas/core/arrays/categorical.py in categories(self, categories) ~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in init(self, categories, ordered) ~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in _finalize(self, categories, ordered, fastpath) ~/anaconda3/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py in validate_categories(categories, fastpath) ValueError: Categorical categories must be unique During handling of the above exception, another exception occurred: ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) ~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) ~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read_stata(filepath_or_buffer, convert_dates, convert_categoricals, encoding, index_col, convert_missing, preserve_dtypes, columns, order_categoricals, chunksize, iterator) ~/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs) ~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in read(self, nrows, convert_dates, convert_categoricals, index_col, convert_missing, preserve_dtypes, columns, order_categoricals) ~/anaconda3/lib/python3.6/site-packages/pandas/io/stata.py in _do_convert_categoricals(self, data, value_label_dict, lbllist, order_categoricals) ValueError: Value labels for column I4 are not unique. The repeated labels are: --------------------------------------------------------------------------------european
|
Seems to require a log-in to get the data file. Can you share it by some other means? |
Yes it requires login, but the registration is completely free. I think it will not be an abuse if I share it temporary for a couple of hours. |
I have downloaded it. Takes a long time to load in pandas. One hint that the file is not strictly valid: in Stata you get
when loading the data. Doesn't mean it should crash necessarily. |
I have taken a look and the major isue is in replacing missing values. This data set has many columns with missing columns. ~2600 out of 2700. These are mostly integer columns, often byte (1 byte) which don't require much storage. Converting these requires casting the values to doubles which requires 8 bytes/entry. This effectively blows up the dataset by a factor if 4ish (some columns have larger types), which makes it impractically big. I suppose the correct solution would be to use an extension type that supports the correct bit width and a missing value. This needs the extension type API to stabilize. A side problem that is probably worth fixing is that the conversion of missing values is very slow. I did a quick hack that reduced the conversion time by a factor of about 1000. For now, you can use the lower level StataReader and not convert missing values (you will need to handle them your self). That will get you past at least one problem. |
Ok! Thanks for pointing the low level interface! Will try it :) |
The other issue is that the labels are not unique. That is, 2 values in stata are getting the same lable. Pandas categoricals don't support this. A work around:
You will then have to apply labels yourself, if you need them. |
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref pandas-dev#25772
Improve performance of StataReader when converting columns with missing values xref #25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes pandas-dev#25772
Improve the explanation when value labels are repeated in Stata dta files. Add suggested methods to workaround the issue using the low level interface. closes #25772
I am trying to read the panel dataset of Russian individuals in stata format. The dataset can be freely obtained at the rlms site.
This results in memory error and that seems strange. Machine has 16gb of memory, the file is less than 4gb.
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: