Automatically crawl `<form>` URLs when `method` is `get` #702

benoit74 · 2024-10-11T19:24:39Z

I have setup a test page at https://website.test.openzim.org/form-get.html

This is a simplified version of something we have encountered in the wild on two occasions.

First on https://chopin.lib.uchicago.edu/. If you open any title and its scores, you will see a combo box in top right corner which is a form with a select combobox. On this website we also have prev/next links so the combobox is not the single navigation option so all pages are crawled.

Second on https://medecine-integree.com/ (we have only been approached by a user of this website, we do not have right to copy ... yet at least). On this website you have a comboxbox "Tous nos articles" which is the single navigation mean to access pages behind this combobox.

These combobox are simply used to "easily" generate a "GET" request with a given query parameter to load proper page.

This is what has been repoduced on https://website.test.openzim.org/form-get.html

Currently Browsertrix crawler does not extract links from this kind of form / combobox, as can be seen by running following command:

docker run -v $PWD/output:/output --name crawlme --rm  webrecorder/browsertrix-crawler:1.3.3 crawl --url "https://website.test.openzim.org/form-get.html" --cwd /output

It would help to be able to automatically crawl these.

Or is there a possibility I missed to customize browsertrix crawler with some JS code to customize link extractions? A bit like custom behaviors, but these are only aimed at loading more resources of a given page, not at extracting new URLs to crawl, if I'm not mistaken.

For now I plan to build on my own a "fake" sitemap to pass to browsertrix crawler to populate proper links.

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to Webrecorder Projects Oct 11, 2024

github-project-automation bot moved this to Triage in Webrecorder Projects Oct 11, 2024

benoit74 mentioned this issue Oct 11, 2024

Combobox on chopin.lib.uchicago.edu scores are not working openzim/warc2zim#409

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically crawl `<form>` URLs when `method` is `get` #702

Automatically crawl `<form>` URLs when `method` is `get` #702

benoit74 commented Oct 11, 2024

Automatically crawl <form> URLs when method is get #702

Automatically crawl <form> URLs when method is get #702

Comments

benoit74 commented Oct 11, 2024

Automatically crawl `<form>` URLs when `method` is `get` #702

Automatically crawl `<form>` URLs when `method` is `get` #702