Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: memory-mapping code #44766

Closed
wants to merge 3 commits into from
Closed

CLN: memory-mapping code #44766

wants to merge 3 commits into from

Conversation

twoertwein
Copy link
Member

Rebased on #44761

No need for codecs.getincrementaldecoder as io.TextWrapperIO will do that (and we can use io.TextWrapperIO because mmap is wrapped inside _IOWrapper). io.TextWrapperIO also provides __next__ for us :)

Probably will need some benchmarking with utf-8/non-utf8 files.

pandas/io/common.py Outdated Show resolved Hide resolved
df = tm.makeDataFrame()
df.to_csv(path, mode="w+b")
tm.assert_frame_equal(df, pd.read_csv(path, index_col=0))
def test_binary_mode():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test and test_warning_missing_utf_bom had nothing to do with mmap but were in its test class.

@twoertwein twoertwein added Clean IO Data IO issues that don't fit into a more specific label labels Dec 5, 2021
pandas/io/common.py Outdated Show resolved Hide resolved

# add one entry with a sepcial character
encoding_ = encoding or "utf-8"
leonardo = "Léonardo".encode(encoding_, errors="ignore").decode(encoding_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more strict test version of the test would have failed on master with the python engine.

@@ -699,15 +699,15 @@ def test_encoding_mmap(memory_map):
GH 23254.
"""
encoding = "iso8859_1"
data = BytesIO(" 1 A Ä 2\n".encode(encoding))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test wasn't using memory_map because it silently failed.

@pep8speaks
Copy link

pep8speaks commented Dec 5, 2021

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-12-05 23:25:04 UTC

@twoertwein
Copy link
Member Author

twoertwein commented Dec 5, 2021

This PR should speed up all read_csv(..., engine="c", encoding="utf-8") calls when a binary handle is used (this should now be faster than using text handles), by applying the same shortcut as done for memory_map in #43787.

Using TextIOWrapper seems to be slower than the previous solution :( Will revert most of the changes in this PR.

@twoertwein twoertwein closed this Dec 6, 2021
@twoertwein twoertwein deleted the mmap branch April 1, 2022 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants