Skip to content
This repository has been archived by the owner on Jan 13, 2021. It is now read-only.

Fixed to read out all data for buffering in BufferedSocket when Data Frame payload is quite large #80

Merged
merged 1 commit into from
Jan 28, 2015

Conversation

t2y
Copy link
Contributor

@t2y t2y commented Jan 25, 2015

Could you confirm whether accessing www.google.co.uk would occur as below?

My environment is OS X Yosemite 10.10.1 (x86_64) and use MacPorts to install Python 3.4.2.

(hyper)$ python
Python 3.4.2 (default, Nov 23 2014, 23:10:23) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from hyper import HTTP20Connection
>>> conn = HTTP20Connection('www.google.co.uk')
>>> conn.request('GET', '/')
1
>>> response = conn.getresponse()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../hyper/http20/connection.py", line 171, in getresponse
    return HTTP20Response(stream.getheaders(), stream)
  File ".../hyper/http20/stream.py", line 293, in getheaders
    self._recv_cb()
  File ".../hyper/http20/connection.py", line 541, in _recv_cb
    self._consume_single_frame()
  File ".../hyper/http20/connection.py", line 483, in _consume_single_frame
    frame, length = Frame.parse_frame_header(header)
  File ".../hyper/http20/frame.py", line 52, in parse_frame_header
    frame = FRAMES[type](stream_id)
KeyError: 97

The buffering issue in BufferedSocket causes this error. In my environment self._sck.recv_into(_buffer) read data only 1135 bytes in spite of _buffer size is 65535 bytes when target length (amt) is 16384 bytes. I'm not sure whether socket.recv() would ensure to read the buffer given length.

$ man 2 recv
     ssize_t
     recv(int socket, void *buffer, size_t length, int flags);

So, I changed to read data from socket until count == 0 (it means no data) or self._bytes_in_buffer == amt.

@Lukasa
Copy link
Member

Lukasa commented Jan 25, 2015

So this is a tricky issue.

recv_into is guaranteed to return as much data as it can into the size of the buffer. What it does not do is block the read until it can fill the buffer. This means that, when you call recv_into the socket layer will copy as much data out of its buffer as it can into the buffer provided.

In the case in question, we have room in the buffer and amt is large, but smaller than the room in the buffer. In principle, therefore, recv_into could copy amt bytes. However, in this specific case the OS layer hasn't received all those bytes yet: it's only received 1135 of them. The rest are still in transit from Google to you.

Because of the way recv_into is written, rather than blocking until all those bytes are received it returns immediately with what it has.

You're correct that the BufferedSocket implementation currently doesn't account for this. It should, but doing so requires that we think very carefully. That's because the BufferedSocket has a very notable extra property, which is that it allows for 'optimistic readahead'.

That's implemented in hyper.http20.connection.HTTP20Connection._recv_cb. This confirms that a socket has data available to read, and if it does will pop frames off the connection. It does this to make sure that we see control frames early, but it must not block. If it blocks here, we cause users to have a severe and negative experience, as their request blocks behind pumping other incomplete frames off the connection.

This change as-is would remove that function, because any frame being read must be read it its entirety. What we need, I think, is a non_blocking flag on BufferedSocket.recv() that will circumvent this loop: essentially, if the socket doesn't have data to read and there isn't enough data in the buffer, return nothing. Does that sound like an extension you could make?

@t2y
Copy link
Contributor Author

t2y commented Jan 25, 2015

Thank you for complaisant explanation!

I have learned the socket behavior about what was happening. I understand that improving the loop function in hyper.http20.connection.HTTP20Connection._recv_cb is better. Then, I agree with BufferedSocket must not block.

I'll investigate more and re-implement this solution.

@Lukasa
Copy link
Member

Lukasa commented Jan 25, 2015

Thanks @t2y! Your work here is extremely valued. =) 🍰

@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

I re-implemented without changing BufferedSocket behavior. However, there's something wrong in testing?

@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

ah, maybe it would need coverage 100. I'll try.

$ coverage report -m --fail-under 100

@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

Before I write some test code, is my changes what you expect?

@Lukasa
Copy link
Member

Lukasa commented Jan 27, 2015

@t2y It's closer! I think a few small changes and we should get there.

Three notes. First, the count variable in _recv_cb is important to ensure that a single stream doesn't get locked into a tight loop of consuming frames. We should keep it. =)

The second problem is that this change allows a DoS attack if the server emits part of a frame and blocks on the rest. That's probably OK in an early version, but we should note it. You don't need to make a change here, just add a comment to mention the DoS risk.

The final problem is that we do a lot of copying of data in this line:

data = memoryview(data.tobytes() + _data.tobytes())

I think at this point we may want to refactor the code a bit to avoid the copies. I suggest using an array.array() that is length long, and then having a single memoryview into which we read the data. That should minimise the copies. It's obviously not perfect, but it's good enough for now.

Let me know if it's not clear what that last point means and I'll write a bit of sample code to show what I mean. =)

Again, @t2y, thanks for this! You're doing fantastic work!

@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

First, the count variable in _recv_cb is important to ensure that a single stream doesn't get locked into a tight loop of consuming frames. We should keep it. =)

oops! I'll revert the changes.

The second problem

I'd like you to write the comment after I finished to fix. Because, I'm not native English writer, so please 😓

The final problem is that we do a lot of copying of data in this line:

Yes. I think so, too. I'll improve it.

@Lukasa
Copy link
Member

Lukasa commented Jan 27, 2015

@t2y I'm happy to write the comment for you. =)

@t2y t2y force-pushed the development branch 2 times, most recently from 7a79c61 to 8d83109 Compare January 27, 2015 11:44
@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

I suggest using an array.array() that is length long, and then having a single memoryview into which we read the data.

According to https://docs.python.org/3.4/library/array.html, it seems that array module cannot handle bytes data. So, I make bytearray with memoryview and handle it. I guess it's better than previous one.

@Lukasa
Copy link
Member

Lukasa commented Jan 27, 2015

Alright, now we're really getting close! This would be merge-able as-is, but I'd like us to try to simplify the code a bit now, so that I can remember what it's doing when I come back in a few months.

I think that, rather than having an if followed by a while, we should just use a while. This means that you need to initialize the buffer and buffer_view ahead of loop. Then, the case where length == 0 becomes a trivial sub-case of the main loop. =)

The second thing is that we should avoid the use of .to_bytes() when copying from one memoryview to another. Instead, you should use the [:] construct to perform the copy. For example, this:

_buffer_view[_end:_end + data_length] = _data.tobytes()

becomes this:

_buffer_view[_end:_end + data_length] = _data[:]

This avoids creating an intermediate bytes object that needs to immediately be garbage collected. In principle, this function shouldn't ever need to call .to_bytes(), because we're always working with memoryviews.

Again, please let me know if any of that was unclear. =)

@t2y
Copy link
Contributor Author

t2y commented Jan 27, 2015

I think that, rather than having an if followed by a while, we should just use a while.

You mean like this?

    def _recv_payload(self, length):                                                
        if not length:                                                              
            return memoryview(b'')                                                  

        buffer = bytearray(length)                                                  
        buffer_view = memoryview(buffer)                                            
        index = 0
        # _sock.recv(length) might not read out all data if the given length is
        # very large. So it should be to retrieve from socket repeatedly.
        while True:                                                                 
            data = self._sock.recv(length)                                          
            data_length = len(data)                                                 
            end = index + data_length                                               
            buffer_view[index:end] = data[:]                                        
            length -= data_length                                                   
            if length == 0 or data_length == 0:
                break                                                               
            index = end                                                             

        return buffer_view[:end]

copying from one memoryview to another. Instead, you should use the [:]

Oh, sounds good. I learned an efficient idiom.

@Lukasa
Copy link
Member

Lukasa commented Jan 28, 2015

Oh, sounds good. I learned an efficient idiom.

Glad to hear it! memoryviews are a fun part of Python that most people don't seem to know about, so this will be a useful skill for you.

As for your loop, I think we can go one better. Rather than using while True with an if...break at the bottom, you can hoist the if test to be the while test (inverting the logic). Thus, your

while True:
    ...
    if length == 0 or data_length == 0:
        break

becomes

while length and data_length:
    ...

@t2y
Copy link
Contributor Author

t2y commented Jan 28, 2015

memoryviews are a fun part of Python that most people don't seem to know about, so this will be a useful skill for you.

Exactly, that excites me.

you can hoist the if test to be the while test (inverting the logic). Thus, your

Absolutely, I didn't notice at that time. 😓 You give me awareness and it is helpful. Try again.

    def _recv_payload(self, length):                                                
        if not length:                                                              
            return memoryview(b'')                                                  

        buffer = bytearray(length)                                                  
        buffer_view = memoryview(buffer)                                            
        index = 0                                                                   
        data_length = -1                                                            
        # _sock.recv(length) might not read out all data if the given length        
        # is very large. So it should be to retrieve from socket repeatedly.        
        while length and data_length:                                               
            data = self._sock.recv(length)                                          
            data_length = len(data)                                                 
            end = index + data_length                                               
            buffer_view[index:end] = data[:]                                        
            length -= data_length                                                   
            index = end                                                             

        return buffer_view[:end]   

@Lukasa
Copy link
Member

Lukasa commented Jan 28, 2015

@t2y Hurrah! This looks great! 🍰 Thank you so much.

Lukasa added a commit that referenced this pull request Jan 28, 2015
Fixed to read out all data for buffering in BufferedSocket when Data Frame payload is quite large
@Lukasa Lukasa merged commit 82a2f33 into python-hyper:development Jan 28, 2015
Lukasa added a commit that referenced this pull request Jan 28, 2015
@Lukasa
Copy link
Member

Lukasa commented Jan 28, 2015

FYI, you've earned your contributors spot here. =D 🍰 ⭐

@t2y
Copy link
Contributor Author

t2y commented Jan 28, 2015

I'm learning http2 with pleasure. Thank you, too! 😄

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants