Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically crawl <form> URLs when method is get #702

Open
benoit74 opened this issue Oct 11, 2024 · 0 comments
Open

Automatically crawl <form> URLs when method is get #702

benoit74 opened this issue Oct 11, 2024 · 0 comments

Comments

@benoit74
Copy link
Contributor

I have setup a test page at https://website.test.openzim.org/form-get.html

This is a simplified version of something we have encountered in the wild on two occasions.

First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.

Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.

These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.

This is what has been repoduced on https://website.test.openzim.org/form-get.html

Currently Browsertrix crawler does not extract links from this kind of form / combobox, as can be seen by running following command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.3 crawl --url "https://website.test.openzim.org/form-get.html" --cwd /output 

It would help to be able to automatically crawl these.

Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.

For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Triage
Development

No branches or pull requests

1 participant