Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to make Prefect work with Headless Selenium? #3609

Closed
feliche93 opened this issue Nov 2, 2020 · 5 comments
Closed

How to make Prefect work with Headless Selenium? #3609

feliche93 opened this issue Nov 2, 2020 · 5 comments

Comments

@feliche93
Copy link

feliche93 commented Nov 2, 2020

Description

I want to use Prefect for some automated scraping of my own social media stats such as posts, profile views on LinkedIn for example. With a lot of javascript and logging in, headless Selenium is the easiest solution so far.

When I run my code inside the file with flow.run() everything works out perfectly.

Registering the flow also works, but when I execute it in a local environment, the logs show following issue:

Unexpected error: TypeError("cannot pickle '_thread.lock' object")
Traceback (most recent call last):
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
    new_state = method(self, state, *args, **kwargs)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 881, in get_task_run_state
    result = self.result.write(value, **formatting_kwargs)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/results/local_result.py", line 116, in write
    value = self.serializer.serialize(new.value)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/prefect/engine/serializers.py", line 70, in serialize
    return cloudpickle.dumps(value)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/felixvemmer/Desktop/social_bots/env/lib/python3.8/site-packages/cloudpickle/cloudpickle_fast.py", line 563, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object

I believe the issue is that I am passing the driver object from one task to the next. However the selenium driver object cannot be pickled as the logs indicate.

Is there any way I can prevent serialising the driver object, and still use the driver (authenticated session) in other tasks? Or what would be a potential work around to make this work?

Expected Behavior

When running the following flow/task in the UI I would expect to not see any issues as when I execute flow.run() :

@task
def create_driver(headless=False):

    # setting options for headless state
    chrome_options = Options()
    if headless:
        chrome_options.add_argument("--window-size=1920,1080")
        chrome_options.add_argument("--start-maximized")
        chrome_options.add_argument("--disable-gpu")
        chrome_options.add_argument('--disable-extensions')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument("--headless")
        chrome_options.add_argument(
            "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36")
    driver = webdriver.Chrome(
        ChromeDriverManager().install(),
        options=chrome_options
    )

    return driver

Here's my flow:

with Flow("Linkedin Automation") as flow:

    headless = Parameter("headless", default=True)

    username = os.getenv("LINKEDIN_USERNAME")
    password = os.getenv("LINKEDIN_PASSWORD")

    driver = create_driver(headless)
    driver = login_linkedin(driver, username, password)

Environment

{
  "config_overrides": {},
  "env_vars": [
    "PREFECT__FLOWS__CHECKPOINTING"
  ],
  "system_information": {
    "platform": "macOS-10.15.6-x86_64-i386-64bit",
    "prefect_backend": "server",
    "prefect_version": "0.13.13",
    "python_version": "3.8.2"
  }
}

Very much appreciate your help!

Thanks, Felix

@0xjimm
Copy link

0xjimm commented Nov 5, 2020

I ran into this issue the other day, Dylan helped me out on the Slack channel.

I defined a local storage and set stored_as_script to True.

from prefect.environments.storage import Local

with Flow('Linkedin Automation") as flow:
    ...

flow.storage = Local(path='path/to/your/flow.py', stored_as_script=True)

flow.run()

@feliche93
Copy link
Author

@lejimmy thank you so much for helping me out on that :) Works like a charm!

Only thing I noticed is that when I split the tasks up into modules and import functions that I get the same error. So for now I guess I have to all tasks with returned driver objects in one file? Or did you also by any chance face this issue? :)

@0xjimm
Copy link

0xjimm commented Nov 7, 2020

That’s what I’ve been doing.

Maybe saving your cookies can help you bypass some steps once you’ve already authenticated: https://stackoverflow.com/a/48665557

@feliche93
Copy link
Author

Answer here and in Slack thread: https://prefect-community.slack.com/archives/CL09KU1K7/p1603318809428700

@cicdw
Copy link
Member

cicdw commented Nov 15, 2020

Archived the thread here for better discoverability: #3669

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants