Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory efficient encoding detection #4112

Closed
decaz opened this issue Sep 27, 2019 · 9 comments
Closed

Memory efficient encoding detection #4112

decaz opened this issue Sep 27, 2019 · 9 comments
Labels
Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/

Comments

@decaz
Copy link
Contributor

decaz commented Sep 27, 2019

At #2549 @asvetlov proposed useful improvement of detection of content encoding to prevent reading the whole response in memory by allowing to specify max data size for sniffing.

UniversalDetector can be used: https://chardet.readthedocs.io/en/latest/usage.html#example-detecting-encoding-incrementally.

@asvetlov
Copy link
Member

Yes, sure.
Would you make a PR?

@decaz
Copy link
Contributor Author

decaz commented Sep 27, 2019

Unfortunately right now I don't have time to, maybe later. I've created issue just not to forget about it.

@asvetlov asvetlov added the Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/ label Oct 3, 2019
@icmpecho
Copy link
Contributor

icmpecho commented Oct 9, 2019

@asvetlov @decaz I was taking a quick look into this and have some open questions. As far as I see the read method of ClientResponse has already read the entire body and put it into the memory. the get_encoding method just make use of that in-memory body right?
So what is your expectation of this PR? Is it about changing the behaviour of read so it return a stream or just splitting the already in-memory body into chunks and feed that into the UniversalDetactor ?

@asvetlov
Copy link
Member

asvetlov commented Oct 9, 2019

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).content.read()is not stored inresp._body`, so we need more complex buffering strategy.

@Transfusion
Copy link
Contributor

We can replace 'resp.read()withresp.content.read(2048)for fetching first 2 Kb buffer (I have no idea what is the best size of sniffing uncoding though).content.read()is not stored inresp._body`

At the time of @decaz 's original PR, await self.read() is always called if the encoding is not detected, i.e. there is no charset in the Content-Type response header from the web server, so _content (aka _body) will always be populated when chardet.detect is called,

However, as of now, get_encoding() always assumes _body is not None, i.e. read(), json(), or text() must be called first before it will work, or else it will throw an error if charset isn't in the Content-Type response header.

def get_encoding(self) -> str:
ctype = self.headers.get(hdrs.CONTENT_TYPE, '').lower()
mimetype = helpers.parse_mimetype(ctype)
encoding = mimetype.parameters.get('charset')
if encoding:
try:
codecs.lookup(encoding)
except LookupError:
encoding = None
if not encoding:
if mimetype.type == 'application' and mimetype.subtype == 'json':
# RFC 7159 states that the default encoding is UTF-8.
encoding = 'utf-8'
else:
encoding = chardet.detect(self._body)['encoding']
if not encoding:
encoding = 'utf-8'

so we need more complex buffering strategy.

  • If ClientResponse.read() has already been called then we can just do detection on the whole body, no issue.
  • If it hasn't (_body is None)...
    • Repeatedly call resp.content.read(n=2048) and append to a self._partial_body = bytearray() in ClientResponse until charset is detected or we reach EOF?
    • Let's say detection is successful after i read 4096 bytes; the 2nd call. Now I want to read the full response, so I call text(). Inside read(), is all we need to do self._body = self._partial_body.append(self.content.read()) ? Because as the name of StreamReader suggests, read() will continue where we left off?

Is this the strategy you are thinking of?

@asvetlov
Copy link
Member

Sorry for misleading in my previous message.
The change would be the following:

  1. Read the entire body as we do it right now.
  2. Pass only first 4KiB to chardet. Memoryview can be used: encoding = chardet.detect(memoryview(self._body)[:4096])['encoding']
    Did I miss something?

P.S.
As I see get_encoding() should be a private method actually but this is another issue.

@Transfusion
Copy link
Contributor

Memoryview cannot be passed into cchardet (the optimized cython version).
File "/home/transfusion/python38/lib/python3.8/site-packages/cchardet/__init__.py", line 35, in feed self._detector.feed(data) TypeError: Argument 'msg' has incorrect type (expected bytes, got memoryview)

I had to use https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes , which has the same effect as directly slicing the bytes object since it copies the data.

I have done a quick test with memory_profiler by passing some webpages incrementally into UniversalDetector(), with cchardet 2.1.4 and chardet 3.0.4.

from memory_profiler import profile
try:
    import cchardet as chardet
except ImportError:
    import chardet
import urllib.request
global _bytes

@profile
def chunked_with_memoryview(chunk_size: int = 2 * 1024):
    detector = chardet.UniversalDetector()
    _body_memoryview = memoryview(_bytes)
    # print(len(_bytes) == len(_body_memoryview))
    for i in range((len(_body_memoryview) // chunk_size) + 1):
        _chunk = _body_memoryview[i * chunk_size: (i + 1) * chunk_size]
        _chunk_bytes = _chunk.tobytes()
        detector.feed(_chunk_bytes)
        del _chunk_bytes
        if detector.done:
            print("chunk " + str(i) + " reached")
            break

    detector.close()
    print(detector.result)


@profile
def without_chunking():
    print(chardet.detect(_bytes)['encoding'])


if __name__ == "__main__":
    global _bytes

    # SHIFT-JIS, GB18030, KOI8-R, UHC/EUC-KR
    for website in ['https://www.jalan.net/', 'http://ctrip.com',
                    'http://nn.ru', 'https://www.incruit.com/']:
        response = urllib.request.urlopen(website)
        _bytes = response.read()

        chunked_with_memoryview()
        without_chunking()

with cchardet: https://gist.github.com/Transfusion/b2075b7a08863c3e5b5afc96b119c29d
with chardet: https://gist.github.com/Transfusion/be47f38522f0a22e6dce6af95243082b

I neither see any significant decrease nor increase in memory usage with cchardet.
There is some small improvement (< 2 MB) if cchardet does not exist and instead falls back to chardet (which btw can work with memoryview). However, the same improvement can be had by simply installing cchardet, unless for some reason the platform doesn't support it.

@asvetlov
Copy link
Member

I'm curious what is the performance improvement if any?

@Dreamsorcerer
Copy link
Member

The recommended libraries for charset encoding seem to outperform these libraries, plus charset detection seems to only be used when we already have the full body, so I don't think there's any value in looking at this any further.

@Dreamsorcerer Dreamsorcerer closed this as not planned Won't fix, can't repro, duplicate, stale Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest We think it's good for https://hacktoberfest.digitalocean.com/
Projects
None yet
Development

No branches or pull requests

5 participants