Support for dynamic url #25

alepodj · 2024-07-31T19:15:35Z

Hi, 1st of all awesome library, makes downloading so much more simple.

Quick thing, Any way to use Pypdl with ThreadPoolExecutor for concurrency?

i see the provided option using a list of predefined tasks/links to PypdlFactory. What if that information is dynamic so i cannot have that static list of URLS before hand, so im fetching the URLs i need 4 at a time using ThreadPoolExecutor. The function fetches the 4 download URLs concurrently and then im using Pypdl to download all 4 files same time and its kinda working, the multiple downloads just flash one on top of the other every second or less in the output.

Its not a major thing, was curious if there was a way to make it work more "cute" with ThreadPoolExecutor

Cheers again for the awesome library :)

from pypdl import Pypdl
from concurrent.futures import ThreadPoolExecutor
from seleniumbase import Driver

def get_download_streams(url):

   driver = Driver(uc=True, log_cdp_events=True, devtools=True)
   driver.get(url)

   logs = driver.get_log('performance')

    for log in logs:
        log = json.loads(log['message'])
        if log['message']['method'] == 'Network.responseReceived':
            if log['message']['params']['response']['mimeType'] == 'video/mp4':
               stream_url = log['message']['params']['response']['url']

               dl = Pypdl()
               dl.start(
                    url = stream_url, 
                    file_path = path, 
                    segments = 4,
                    retries = 3,
                    mirror_func= check_connection
                )
               
with ThreadPoolExecutor(max_workers=4) as executor:
    for link in links:
        executor.submit(get_download_streams, link)

The text was updated successfully, but these errors were encountered:

mjishnu · 2024-08-01T06:16:17Z

hello, the issue you are experiencing is due to the fact that multiple pypdl instances are trying to write to the console at the same time and overwriting each other's progress bar in fact the download should be happening properly without any issue if you are providing different path for each download, so pypdl should work well with threadpoolexecutor since pypdl under the hood also uses threadpoolexecutor and we are just trying to wrap around it, hence setting block = false should have same effect as using threadpoolexecutor and would be more efficient since we are avoiding unwanted double wrapping.

Now to fix the issue of progress bar we could disable the progress bar of each pypdl instance by setting display=False and creating a custom progress bar by using the combined values of all progress attribute of every pypdl instances similar to how it is done in pypdlfactory. Now this is going to be pretty tedious so a better approach would be to adapt the code for pypdlfactory, I have made few changes so that you can now pass functions to url parameter you can try this in the test version. here is a example of similar scenario here the mirror_func parameter takes the dynamic url but it should be similar for url parameter as well.

import json
from seleniumbase import Driver
from pypdl import PypdlFactory


def get_download_streams(url):
    driver = Driver(uc=True, log_cdp_events=True, devtools=True)
    driver.get(url)

    logs = driver.get_log("performance")

    for log in logs:
        log = json.loads(log["message"])
        if log["message"]["method"] == "Network.responseReceived":
            if log["message"]["params"]["response"]["mimeType"] == "video/mp4":
                stream_url = log["message"]["params"]["response"]["url"]
                return stream_url


tasks = []
# file will be saved to a folder called downloads (assuming its alreaded created)
for link in links:
    tasks.append((lambda: get_download_streams(link), {"file_path": "downloads/"}))

# create a factory with 4 workers
factory = PypdlFactory(4)
x = factory.start(tasks)

alepodj · 2024-08-01T16:33:20Z

Thanks for the detailed response

mjishnu · 2024-08-03T13:19:38Z

did it fix the issue?

alepodj · 2024-08-04T01:16:03Z

yeah all good, thanks again

mjishnu changed the title ~~Any way to use Pypdl with ThreadPoolExecutor~~ Support for dynamic url Aug 4, 2024

mjishnu closed this as completed Aug 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for dynamic url #25

Support for dynamic url #25

alepodj commented Jul 31, 2024

mjishnu commented Aug 1, 2024

alepodj commented Aug 1, 2024

mjishnu commented Aug 3, 2024

alepodj commented Aug 4, 2024

Support for dynamic url #25

Support for dynamic url #25

Comments

alepodj commented Jul 31, 2024

mjishnu commented Aug 1, 2024

alepodj commented Aug 1, 2024

mjishnu commented Aug 3, 2024

alepodj commented Aug 4, 2024