Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent sticky Pypdl._kwargs['headers']['range'] reference bug when reusing Pypdl object with user-supplied headers and multiple multi-segment downloads #24

Merged

Conversation

Speyedr
Copy link
Contributor

@Speyedr Speyedr commented Jul 21, 2024

A simple workaround for a bug with a complicated chain of events, due an oversight in the usage of the Pypdl._kwargs attribute.

This PR fixes #23.

When creating a Pypdl object, one has the ability to pass in arbitrary headers which will then be forwarded later to an aiohttp.ClientSession() object. These headers are collected within **kwargs and then stored in the self._kwargs attribute:

self._kwargs = {
"timeout": aiohttp.ClientTimeout(sock_read=60),
"raise_for_status": True,
}
self._kwargs.update(kwargs)

This dictionary is passed to multiple objects during normal procedure, such as during the Pypdl._get_header() method...

async def _get_header(self, url):
async with aiohttp.ClientSession() as session:
async with session.head(url, **self._kwargs) as response:
if response.status == 200:
self.logger.debug("Header accquired from head request")
return response.headers

...and the Pypdl._multi_segment() method:

async def _multi_segment(self, segments, segment_table):
tasks = []
self.logger.debug("Multi-Segment download started")
async with aiohttp.ClientSession() as session:
for segment in range(segments):
md = Multidown(self._interrupt)
self._workers.append(md)
tasks.append(
asyncio.create_task(
md.worker(segment_table, segment, session, **self._kwargs)
)
)

In order to request ranges of bytes to perform multi-segment downloads, Multidown.worker() collects the user-supplied arguments in **kwargs and then later adds a range header, as this needs to be passed later to another aiohttp.ClientSession() object:

async def worker(
self, segment_table: dict, id: int, session: ClientSession, **kwargs
) -> None:

Line 65 updates the kwargs['headers'] dictionary.
if self.curr < size:
start = start + self.curr
kwargs.setdefault("headers", {}).update({"range": f"bytes={start}-{end}"})
await self.download(url, segment_path, "ab", session, **kwargs)

This has an unintended side-effect that:

  • If a user has passed custom headers through Pypdl(headers=headers), then Line 65 permanently alters Pypdl._kwargs due to Python passing Pypdl._kwargs by reference instead of by value.
  • When a subsequent request is made later by Pypdl._get_header() (see Line 249 above), the range header is still present and will be included in a request that is not meant to include a range header.
  • Pypdl._get_header() only expects to receive status code 200 in response, however the MDN web docs suggest that servers should respond with 206 if a range header is supplied.
  • If a server does respond with 206 then the Pypdl._get_header() method falls off and returns None. This has dire consequences when used by Pypdl._get_info(), as it sets the local attribute header to None...

def _get_info(self, url, file_path, multisegment, etag):
header = asyncio.run(self._get_header(url))
file_path = get_filepath(url, header, file_path)

  • ...which when passed to get_filepath() will result in a None reference error.

pypdl/pypdl/utls.py

Lines 31 to 32 in 12a20ad

def get_filepath(url: str, headers: Dict, file_path: str) -> str:
content_disposition = headers.get("Content-Disposition", None)

Stacktrace:

(Pypdl)  21-07-24 15:21:29 - ERROR: (AttributeError) ['NoneType' object has no attribute 'get']
Traceback (most recent call last):
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 103, in download
    result = self._execute(
             ^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 175, in _execute
    file_path, multisegment, etag = self._get_info(
                                    ^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\pypdl_manager.py", line 230, in _get_info
    file_path = get_filepath(url, header, file_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Speyedr\Documents\PyCharm Projects\pypdl-sticky-header-investigation\pypdl\utls.py", line 32, in get_filepath
    content_disposition = headers.get("Content-Disposition", None)
                          ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'get'

This bug has likely gone unsighted because it requires:

  • A user downloading multiple files using the same Pypdl object
  • The user passing in custom headers
  • The server supporting multi-segment downloads
  • The server returning 206 because the range header was included, instead of ignoring it and simply returning 200.

This PR includes a simple addition to the Multidown.worker() method that creates a shallow copy of kwargs['headers'] if it already exists, which disconnects it from the original reference and allows the range header to be added without affecting Pypdl._kwargs['header'].

This fix is faster than alternative patches which may require the use of deepcopy to properly disconnect or reset the Pypdl._kwargs attribute, however as a consequence this PR does not prevent the same issue from occurring again (i.e. unintentionally modifying attributes of Pypdl that have been passed by reference to callee functions).

@Speyedr
Copy link
Contributor Author

Speyedr commented Jul 21, 2024

Fixed anti-pattern issue, might want to squash commits.

@mjishnu mjishnu merged commit 1327b7c into mjishnu:main Jul 21, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants