Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: still getting UnicodeDecodeError with encoding_errors="ignore" #57569

Open
3 tasks done
dadrake3 opened this issue Feb 22, 2024 · 3 comments
Open
3 tasks done

BUG: still getting UnicodeDecodeError with encoding_errors="ignore" #57569

dadrake3 opened this issue Feb 22, 2024 · 3 comments
Assignees
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@dadrake3
Copy link

dadrake3 commented Feb 22, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import gzip
from io import BytesIO

file_path = ...

with gzip.open(file_path, "rb") as gz:
    data = gz.read()
encoding = "ascii"
bytes_io = BytesIO(data)
df = pd.read_csv(
    bytes_io,
    encoding=encoding,
    encoding_errors="ignore",
    delimiter="|",
    dtype=str,
    on_bad_lines=lambda x: "skip",
    engine="pyarrow",
    keep_default_na=False,
)

Issue Description

I am trying to use pandas internals to decode my file rather than calling data.decode(encoding...) and passing in a stringIO object, to save on ram.

However when I do this I am getting this error

  File "/root/conda/envs/main/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1024, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/conda/envs/main/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 624, in _read
    return parser.read(nrows)
           ^^^^^^^^^^^^^^^^^^
  File "/root/conda/envs/main/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1909, in read
    df = self._engine.read()  # type: ignore[attr-defined]
         ^^^^^^^^^^^^^^^^^^^
  File "/root/conda/envs/main/lib/python3.11/site-packages/pandas/io/parsers/arrow_parser_wrapper.py", line 266, in read
    table = pyarrow_csv.read_csv(
            ^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_csv.pyx", line 1261, in pyarrow._csv.read_csv
  File "pyarrow/_csv.pyx", line 1270, in pyarrow._csv.read_csv
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118
  File "pyarrow/io.pxi", line 1843, in pyarrow.lib._cb_transform
  File "pyarrow/io.pxi", line 1884, in pyarrow.lib.Transcoder.__call__
  File "/root/conda/envs/main/lib/python3.11/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10356: ordinal not in range(128)

this happens when I use encoding_errors='ignore' or encoding_errors='replace'

Expected Behavior

This should behave the same way as if I were doing this

string_io = StringIO(data.decode(encoding, errors="replace"))
df = pd.read_csv(string_io, ...)

and ignore the unicode errors in my file

Installed Versions

INSTALLED VERSIONS

commit : f538741
python : 3.11.8.final.0
python-bits : 64
OS : Linux
OS-release : 6.5.0-17-generic
Version : #17-Ubuntu SMP PREEMPT_DYNAMIC Thu Jan 11 14:20:13 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 2.2.0
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.1.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 15.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 2.0.27
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@dadrake3 dadrake3 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 22, 2024
@eshaready
Copy link
Contributor

take

@regieleki
Copy link

take

@eshaready
Copy link
Contributor

eshaready commented Apr 13, 2024

hello! @dadrake3 could you provide us with the file or the snippet of the file that you used? we can't reproduce the bug right now with our own file with a unicode decode error, it's following the expected encoding errors = ignore behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

3 participants