Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull request #778 breaks streaming inputs #779

Closed
fgregg opened this issue Apr 28, 2024 · 8 comments · Fixed by #780
Closed

pull request #778 breaks streaming inputs #779

fgregg opened this issue Apr 28, 2024 · 8 comments · Fixed by #780

Comments

@fgregg
Copy link
Contributor

fgregg commented Apr 28, 2024

PR #778 introduced a seek that is not compatible with stdin and other inputs that do not provide seek.

@fgregg fgregg changed the title recent commit breaks streaming inputs pull request #778 breaks streaming inputs Apr 28, 2024
@fgregg fgregg closed this as completed Apr 28, 2024
@fgregg fgregg reopened this Apr 28, 2024
@jpmckinney
Copy link
Member

Did 1.10.1 fix it? Can you provide a way to reproduce?

@jpmckinney
Copy link
Member

Aha you edited - I’ll add a test.

@fgregg
Copy link
Contributor Author

fgregg commented Apr 28, 2024

if you are mucking about in there, i'm pretty sure that you will need a seek or other way of resetting the file in this block.

if sniff_limit is None:
# Reads to the end of the tile, but avoid reading the file twice.
handle = StringIO(f.read())
kwargs['dialect'] = csv.Sniffer().sniff(handle.getvalue())

@fgregg
Copy link
Contributor Author

fgregg commented Apr 28, 2024

not an agate test but, here's a csvkit repro.

echo "a" | csvsort

@jpmckinney
Copy link
Member

jpmckinney commented Apr 28, 2024

Yeah, It can read from stdin csvsort < examples/dummy.csv, but not from pipes. Not sure how to test pipes in pytest, so I'll add some sample commands to CI.

 if sniff_limit is None: 
     # Reads to the end of the tile, but avoid reading the file twice. 
     handle = StringIO(f.read()) 
     kwargs['dialect'] = csv.Sniffer().sniff(handle.getvalue())

We're passing handle.getvalue() which is an str. The position of the original StringIO doesn't move. Right?

getvalue() can be called consecutively with the same result.

@fgregg
Copy link
Contributor Author

fgregg commented Apr 28, 2024

got it. got confused with the overwriting of handle

@jpmckinney
Copy link
Member

jpmckinney commented Apr 28, 2024

I started #780

The docstring says sniff_limit reads bytes, which is true for stdin (now), but it actually reads characters for the original behavior. I don't know if it's worth forcing reading characters in both cases. I'm currently doing something like:

Would need something like:

I can also check if anyone wrote a peekable TextIOWrapper

I should also see what happens if sys.stdin.buffer is passed in. Edit: buffer is bytes and the old behavior would error on bytes. So, nothing to do there.

Edit: Add test for sniff limit that ends within UTF8 character. Edit2: sys.stdin.buffer.peek() always reads the entire input, up to 65536 bytes (on my machine at least, and in CPython at least). So, a test would be difficult, but I checked the logic with:

import io
io.BufferedReader(io.BytesIO('ʤ,ʤ,c\n1,2,3'.encode()), 4).peek().decode('utf-8')
# UnicodeDecodeError
io.BufferedReader(io.BytesIO('ʤ,ʤ,c\n1,2,3'.encode()), 4).peek().decode('utf-8', 'ignore')
# 'ʤ,'

@jpmckinney
Copy link
Member

Thank you for reporting! Should be fixed in 1.10.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants