-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_csv() surrogates not allowed #22610
Comments
Forgot to say that the workaround is to make sure you have no UTF-8 surrogates in your data. In my case this meant that I needed to decode / reencode a field that came from another library. For example: field = field.encode('utf-8', errors='surrogatepass') |
Is this possible to do with the stdlib csv writer? |
This (plain open): import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w") as _file:
writer = csv.writer(_file)
writer.writerow(row) will yield the error below:
But import csv
row = '\ud800'
with open("test-you-can-delete.csv", "w", errors='surrogatepass') as _file:
writer = csv.writer(_file)
writer.writerow(row) This doesn't generate an error. Implementing the named argument |
Makes sense - would accept a PR if you are up for it |
It's a busy time of year but I might get back at it later. In the meantime, if anyone else is interested at the problem, it is documented with a workaround and a link to a similar fix in |
I would like to point out that this approach can produce malformed UTF-8 so I'm not sure if that's a good path forward. For proof:
|
Of course, you should use the error handler that fits your need based on context: https://docs.python.org/3/library/codecs.html#error-handlers I could have used If I still had the original data, I would update my workaround above but I can't 100% confirm that using |
take |
Code Sample
Stack trace:
Problem description
The presence of Unicode surrogates in a dataframe (or Series) causes an error in
.to_csv()
. This has already been fixed in.to_hdf()
by allowing theerrors=
argument to be used where we can use thesurrogatepass
orsurrogateescape
error handler.See the original bug report and the PR that fixed it.
Expected Output
No error.
Output of
pd.show_versions()
I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.
The text was updated successfully, but these errors were encountered: