Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for dynamic url #25

Closed
alepodj opened this issue Jul 31, 2024 · 4 comments
Closed

Support for dynamic url #25

alepodj opened this issue Jul 31, 2024 · 4 comments

Comments

@alepodj
Copy link

alepodj commented Jul 31, 2024

Hi, 1st of all awesome library, makes downloading so much more simple.

Quick thing, Any way to use Pypdl with ThreadPoolExecutor for concurrency?

i see the provided option using a list of predefined tasks/links to PypdlFactory. What if that information is dynamic so i cannot have that static list of URLS before hand, so im fetching the URLs i need 4 at a time using ThreadPoolExecutor. The function fetches the 4 download URLs concurrently and then im using Pypdl to download all 4 files same time and its kinda working, the multiple downloads just flash one on top of the other every second or less in the output.

Its not a major thing, was curious if there was a way to make it work more "cute" with ThreadPoolExecutor

Cheers again for the awesome library :)

2downloads

from pypdl import Pypdl
from concurrent.futures import ThreadPoolExecutor
from seleniumbase import Driver

def get_download_streams(url):

   driver = Driver(uc=True, log_cdp_events=True, devtools=True)
   driver.get(url)

   logs = driver.get_log('performance')

    for log in logs:
        log = json.loads(log['message'])
        if log['message']['method'] == 'Network.responseReceived':
            if log['message']['params']['response']['mimeType'] == 'video/mp4':
               stream_url = log['message']['params']['response']['url']

               dl = Pypdl()
               dl.start(
                    url = stream_url, 
                    file_path = path, 
                    segments = 4,
                    retries = 3,
                    mirror_func= check_connection
                )
               
with ThreadPoolExecutor(max_workers=4) as executor:
    for link in links:
        executor.submit(get_download_streams, link)
@mjishnu
Copy link
Owner

mjishnu commented Aug 1, 2024

hello, the issue you are experiencing is due to the fact that multiple pypdl instances are trying to write to the console at the same time and overwriting each other's progress bar in fact the download should be happening properly without any issue if you are providing different path for each download, so pypdl should work well with threadpoolexecutor since pypdl under the hood also uses threadpoolexecutor and we are just trying to wrap around it, hence setting block = false should have same effect as using threadpoolexecutor and would be more efficient since we are avoiding unwanted double wrapping.

Now to fix the issue of progress bar we could disable the progress bar of each pypdl instance by setting display=False and creating a custom progress bar by using the combined values of all progress attribute of every pypdl instances similar to how it is done in pypdlfactory. Now this is going to be pretty tedious so a better approach would be to adapt the code for pypdlfactory, I have made few changes so that you can now pass functions to url parameter you can try this in the test version. here is a example of similar scenario here the mirror_func parameter takes the dynamic url but it should be similar for url parameter as well.

import json
from seleniumbase import Driver
from pypdl import PypdlFactory


def get_download_streams(url):
    driver = Driver(uc=True, log_cdp_events=True, devtools=True)
    driver.get(url)

    logs = driver.get_log("performance")

    for log in logs:
        log = json.loads(log["message"])
        if log["message"]["method"] == "Network.responseReceived":
            if log["message"]["params"]["response"]["mimeType"] == "video/mp4":
                stream_url = log["message"]["params"]["response"]["url"]
                return stream_url


tasks = []
# file will be saved to a folder called downloads (assuming its alreaded created)
for link in links:
    tasks.append((lambda: get_download_streams(link), {"file_path": "downloads/"}))

# create a factory with 4 workers
factory = PypdlFactory(4)
x = factory.start(tasks)

@alepodj
Copy link
Author

alepodj commented Aug 1, 2024

Thanks for the detailed response

@mjishnu
Copy link
Owner

mjishnu commented Aug 3, 2024

did it fix the issue?

@alepodj
Copy link
Author

alepodj commented Aug 4, 2024

yeah all good, thanks again

@mjishnu mjishnu changed the title Any way to use Pypdl with ThreadPoolExecutor Support for dynamic url Aug 4, 2024
@mjishnu mjishnu closed this as completed Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants