Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Speyedr · 2024-07-21T10:21:54Z

A simple workaround for a bug with a complicated chain of events, due an oversight in the usage of the `Pypdl._kwargs` attribute.

This PR fixes #23.

When creating a Pypdl object, one has the ability to pass in arbitrary headers which will then be forwarded later to an aiohttp.ClientSession() object. These headers are collected within **kwargs and then stored in the self._kwargs attribute:

pypdl/pypdl/pypdl_manager.py

Lines 44 to 48 in 12a20ad

    
           self._kwargs = { 
        
               "timeout": aiohttp.ClientTimeout(sock_read=60), 
        
               "raise_for_status": True, 
        
           } 
        
           self._kwargs.update(kwargs)

This dictionary is passed to multiple objects during normal procedure, such as during the Pypdl._get_header() method...

pypdl/pypdl/pypdl_manager.py

Lines 247 to 252 in 12a20ad

    
           async def _get_header(self, url): 
        
               async with aiohttp.ClientSession() as session: 
        
                   async with session.head(url, **self._kwargs) as response: 
        
                       if response.status == 200: 
        
                           self.logger.debug("Header accquired from head request") 
        
                           return response.headers

...and the Pypdl._multi_segment() method:

pypdl/pypdl/pypdl_manager.py

Lines 259 to 270 in 12a20ad

    
           async def _multi_segment(self, segments, segment_table): 
        
               tasks = [] 
        
               self.logger.debug("Multi-Segment download started") 
        
               async with aiohttp.ClientSession() as session: 
        
                   for segment in range(segments): 
        
                       md = Multidown(self._interrupt) 
        
                       self._workers.append(md) 
        
                       tasks.append( 
        
                           asyncio.create_task( 
        
                               md.worker(segment_table, segment, session, **self._kwargs) 
        
                           ) 
        
                       )

In order to request ranges of bytes to perform multi-segment downloads, Multidown.worker() collects the user-supplied arguments in **kwargs and then later adds a range header, as this needs to be passed later to another aiohttp.ClientSession() object:

pypdl/pypdl/downloader.py

Lines 46 to 48 in 12a20ad

    
           async def worker( 
        
               self, segment_table: dict, id: int, session: ClientSession, **kwargs 
        
           ) -> None:

Line 65 updates the kwargs['headers'] dictionary.

pypdl/pypdl/downloader.py

Lines 63 to 66 in 12a20ad

    
           if self.curr < size: 
        
               start = start + self.curr 
        
               kwargs.setdefault("headers", {}).update({"range": f"bytes={start}-{end}"}) 
        
               await self.download(url, segment_path, "ab", session, **kwargs)

This has an unintended side-effect that:

If a user has passed custom headers through Pypdl(headers=headers), then Line 65 permanently alters Pypdl._kwargs due to Python passing Pypdl._kwargs by reference instead of by value.
When a subsequent request is made later by Pypdl._get_header() (see Line 249 above), the range header is still present and will be included in a request that is not meant to include a range header.
Pypdl._get_header() only expects to receive status code 200 in response, however the MDN web docs suggest that servers should respond with 206 if a range header is supplied.
If a server does respond with 206 then the Pypdl._get_header() method falls off and returns None. This has dire consequences when used by Pypdl._get_info(), as it sets the local attribute header to None...

pypdl/pypdl/pypdl_manager.py

Lines 228 to 230 in 12a20ad

    
           def _get_info(self, url, file_path, multisegment, etag): 
        
               header = asyncio.run(self._get_header(url)) 
        
               file_path = get_filepath(url, header, file_path)

...which when passed to get_filepath() will result in a None reference error.

pypdl/pypdl/utls.py

Lines 31 to 32 in 12a20ad

    
           def get_filepath(url: str, headers: Dict, file_path: str) -> str: 
        
               content_disposition = headers.get("Content-Disposition", None)

Stacktrace:

(Pypdl)  21-07-24 15:21:29 - ERROR: (AttributeError) ['NoneType' object has no attribute 'get']
Traceback (most recent call last):
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 103, in download
    result = self._execute(
             ^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 175, in _execute
    file_path, multisegment, etag = self._get_info(
                                    ^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 230, in _get_info
    file_path = get_filepath(url, header, file_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\utls.py", line 32, in get_filepath
    content_disposition = headers.get("Content-Disposition", None)
                          ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

This bug has likely gone unsighted because it requires:

A user downloading multiple files using the same Pypdl object
The user passing in custom headers
The server supporting multi-segment downloads
The server returning 206 because the range header was included, instead of ignoring it and simply returning 200.

This PR includes a simple addition to the Multidown.worker() method that creates a shallow copy of kwargs['headers'] if it already exists, which disconnects it from the original reference and allows the range header to be added without affecting Pypdl._kwargs['header'].

This fix is faster than alternative patches which may require the use of deepcopy to properly disconnect or reset the Pypdl._kwargs attribute, however as a consequence this PR does not prevent the same issue from occurring again (i.e. unintentionally modifying attributes of Pypdl that have been passed by reference to callee functions).

…eader-fix' into multidown-prevent-sticky-range-header-fix # Conflicts: # pypdl/downloader.py

Speyedr · 2024-07-21T10:35:44Z

Fixed anti-pattern issue, might want to squash commits.

Speyedr added 3 commits July 21, 2024 14:46

Prevent "sticky range header" bug.

e48f65b

Prevent "sticky range header" bug.

6b308cf

Merge remote-tracking branch 'origin/multidown-prevent-sticky-range-h…

282b789

…eader-fix' into multidown-prevent-sticky-range-header-fix # Conflicts: # pypdl/downloader.py

mjishnu merged commit 1327b7c into mjishnu:main Jul 21, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Speyedr commented Jul 21, 2024

Speyedr commented Jul 21, 2024

	self._kwargs = {
	"timeout": aiohttp.ClientTimeout(sock_read=60),
	"raise_for_status": True,
	}
	self._kwargs.update(kwargs)

	async def _get_header(self, url):
	async with aiohttp.ClientSession() as session:
	async with session.head(url, **self._kwargs) as response:
	if response.status == 200:
	self.logger.debug("Header accquired from head request")
	return response.headers

	async def _multi_segment(self, segments, segment_table):
	tasks = []
	self.logger.debug("Multi-Segment download started")
	async with aiohttp.ClientSession() as session:
	for segment in range(segments):
	md = Multidown(self._interrupt)
	self._workers.append(md)
	tasks.append(
	asyncio.create_task(
	md.worker(segment_table, segment, session, **self._kwargs)
	)
	)

	async def worker(
	self, segment_table: dict, id: int, session: ClientSession, **kwargs
	) -> None:

	if self.curr < size:
	start = start + self.curr
	kwargs.setdefault("headers", {}).update({"range": f"bytes={start}-{end}"})
	await self.download(url, segment_path, "ab", session, **kwargs)

	def _get_info(self, url, file_path, multisegment, etag):
	header = asyncio.run(self._get_header(url))
	file_path = get_filepath(url, header, file_path)

	def get_filepath(url: str, headers: Dict, file_path: str) -> str:
	content_disposition = headers.get("Content-Disposition", None)

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Conversation

Speyedr commented Jul 21, 2024

A simple workaround for a bug with a complicated chain of events, due an oversight in the usage of the Pypdl._kwargs attribute.

Stacktrace:

Speyedr commented Jul 21, 2024

A simple workaround for a bug with a complicated chain of events, due an oversight in the usage of the `Pypdl._kwargs` attribute.