Skip to content

Commit

Permalink
feat: Merge workflow job types (#2068)
Browse files Browse the repository at this point in the history
Resolves #2073

### Changes

- Removes "URL List" and "Seeded Crawl" job type distinction and adds as
additional crawl scope types instead.
- 'New Workflow' button defaults to Single Page
- 'New Workflow' dropdown includes Page Crawl (Single Page, Page List, In-Page Links) and Site Crawl (Page in Same Directory, Page on Same Domain, + Subdomains and Custom Page Prefix)
- Enables specifying `DOCS_URL` in `.env`
- Additional follow-ups in #2090, #2091
  • Loading branch information
SuaYoo authored Sep 25, 2024
1 parent 62da0fb commit 612bbb6
Show file tree
Hide file tree
Showing 28 changed files with 911 additions and 908 deletions.
12 changes: 6 additions & 6 deletions docs/user-guide/crawl-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,17 @@ Create new crawl workflows from the **Crawling** page, or the _Create New ..._

### Choose what to crawl

The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **Page List** or **Site Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.
The first step in creating a new crawl workflow is to choose what you'd like to crawl by defining a **Crawl Scope**. Crawl scopes are categorized as a **Page Crawl** or **Site Crawl**.

#### Page List
#### Page Crawl

Choose this option if you already know the URL of every page you'd like to crawl. The crawler will visit every URL specified in a list, and optionally every URL linked on those pages.
Choose one of these crawl scopes if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.

A Page List workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
A Page Crawl workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.

#### Site Crawl

Let the crawler automatically discover pages based on a domain or start page that you specify.
Choose one of these crawl scopes to have the the crawler automatically find pages based on a domain name, start page URL, or directory on a website.

Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.

Expand All @@ -34,7 +34,7 @@ Run a crawl workflow by clicking _Run Crawl_ in the actions menu of the workflow

While crawling, the **Watch Crawl** section displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue. You can [modify the crawl live](./running-crawl.md) by adding URL exclusions or changing the number of crawling instances.

Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated [crawl scope](workflow-setup.md#crawl-scope).
Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated [crawl scope](workflow-setup.md#crawl-scope-options).

## Status

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Once you've logged in you should see your org [overview](overview.md). If you la
After running your first crawl, check out the following to learn more about Browsertrix's features:

- A detailed list of [crawl workflow setup](workflow-setup.md) options.
- Adding [exclusions](workflow-setup.md#exclusions) to limit your crawl's scope and evading crawler traps by [editing exclusion rules while crawling](running-crawl.md#live-exclusion-editing).
- Adding [exclusions](workflow-setup.md#exclude-pages) to limit your crawl's scope and evading crawler traps by [editing exclusion rules while crawling](running-crawl.md#live-exclusion-editing).
- Best practices for crawling with [browser profiles](browser-profiles.md) to capture content only available when logged in to a website.
- Managing archived items, including [uploading previously archived content](archived-items.md#uploading-web-archives).
- Organizing and combining archived items with [collections](collections.md) for sharing and export.
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/running-crawl.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ A crawl workflow that is in progress can be in one of the following states:

## Live Exclusion Editing

While [exclusions](workflow-setup.md#exclusions) can be set before running a crawl workflow, sometimes while crawling the crawler may find new parts of the site that weren't previously known about and shouldn't be crawled, or get stuck browsing parts of a website that automatically generate URLs known as ["crawler traps"](https://en.wikipedia.org/wiki/Spider_trap).
While [exclusions](workflow-setup.md#exclude-pages) can be set before running a crawl workflow, sometimes while crawling the crawler may find new parts of the site that weren't previously known about and shouldn't be crawled, or get stuck browsing parts of a website that automatically generate URLs known as ["crawler traps"](https://en.wikipedia.org/wiki/Spider_trap).

If the crawl queue is filled with URLs that should not be crawled, use the _Edit Exclusions_ button on the Watch Crawl page to instruct the crawler what pages should be excluded from the queue.

Expand Down
105 changes: 68 additions & 37 deletions docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,83 +6,114 @@ Changes to a setting will only apply to subsequent crawls.

Crawl settings are shown in the crawl workflow detail **Settings** tab and in the archived item **Crawl Settings** tab.

## Crawl Scope
## Scope

Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _URL List_ or _Site Crawl_ when creating a new workflow.
Specify the range and depth of your crawl.

??? example "Crawling with HTTP basic auth"

Both Page List and Site Crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:[email protected]`.

**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.
Crawl scopes are categorized as a **Page Crawl** or **Site Crawl**:

### Crawl Type: Page List
_Page Crawl_
: Choose one of these crawl scopes if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.

#### Page URL(s)
A Page Crawl workflow can be simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.

A list of one or more URLs that the crawler should visit and capture.
??? info "Page Crawl Use Cases"
- You want to archive a social media post (`Single Page`)
- You have a list of URLs that you can copy-and-paste (`List of Pages`)
- You want to include URLs with different domain names in the same crawl (`List of Pages`)

#### Include Any Linked Page
_Site Crawl_
: Choose one of these crawl scopes to have the the crawler automatically find pages based on a domain name, start page URL, or directory on a website.

When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field.
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.

??? example "Crawling tags & search queries with Page List crawls"
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page.
??? info "Site Crawl Use Cases"
- You're archiving a subset of a website, like everything under _website.com/your-username_ (`Pages in Same Directory`)
- You're archiving an entire website _and_ external pages linked to from the website (`Pages on Same Domain` + _Include Any Linked Page_ checked)

#### Fail Crawl on Failed URL
### Crawl Scope Options

When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed".
#### Page Crawl

### Crawl Type: Site Crawl
`Single Page`
: Crawls a single URL and does not include any linked pages.

#### Crawl Start URL
`List of Pages`
: Crawls only specified URLs and does not include any linked pages.

This is the first page that the crawler will visit. It's important to set _Crawl Start URL_ that accurately represents the scope of the pages you wish to crawl as the _Start URL Scope_ selection will depend on this field's contents.
`In-Page Links`
: Crawls only the specified URL and treats linked sections of the page as distinct pages.

You must specify the protocol (likely `http://` or `https://`) as a part of the URL entered into this field.
Any link that begins with the _Crawl Start URL_ followed by a hashtag symbol (`#`) and then a string is considered an in-page link. This is commonly used to link to a section of a page. For example, because the "Scope" section of this guide is linked by its heading as `/user-guide/workflow-setup/#scope` it would be treated as a separate page under the _In-Page Links_ scope.

#### Start URL Scope
This scope can also be useful for crawling websites that are single-page applications where each page has its own hash, such as `example.com/#/blog` and `example.com/#/about`.

`Hashtag Links Only`
: This scope will ignore links that lead to other addresses such as `example.com/path` and will instead instruct the crawler to visit hashtag links such as `example.com/#linkedsection`.
#### Site Crawl

This scope can be useful for crawling certain web apps that may not use unique URLs for their pages.

`Pages in the Same Directory`
`Pages in Same Directory`
: This scope will only crawl pages in the same directory as the _Crawl Start URL_. If `example.com/path` is set as the _Crawl Start URL_, `example.com/path/path2` will be crawled but `example.com/path3` will not.

`Pages on This Domain`
`Pages on Same Domain`
: This scope will crawl all pages on the domain entered as the _Crawl Start URL_ however it will ignore subdomains such as `subdomain.example.com`.

`Pages on This Domain and Subdomains`
`Pages on Same Domain + Subdomains`
: This scope will crawl all pages on the domain and any subdomains found. If `example.com` is set as the _Crawl Start URL_, both pages on `example.com` and `subdomain.example.com` will be crawled.

`Custom Page Prefix`
: This scope will crawl all pages that begin with the _Crawl Start URL_ as well as pages from any URL that begin with the URLs listed in `Extra URL Prefixes in Scope`

#### Max Depth
### Page URL(s)

Only shown with a _Start URL Scope_ of `Pages on This Domain` and above, the _Max Depth_ setting instructs the crawler to stop visiting new links past a specified depth.
One or more URLs of the page to crawl. URLs must follow [valid URL syntax](https://www.w3.org/Addressing/URL/url-spec.html). For example, if you're crawling a page that can be accessed on the public internet, your URL should start with `http://` or `https://`.

#### Extra URL Prefixes in Scope
??? example "Crawling with HTTP basic auth"

Only shown with a _Start URL Scope_ of `Custom Page Prefix`, this field accepts additional URLs or domains that will be crawled if URLs that lead to them are found.
All crawl scopes support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:[email protected]`.

**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.

This can be useful for crawling websites that span multiple domains such as `example.org` and `example.net`
### Crawl Start URL

#### Include Any Linked Page ("one hop out")
This is the first page that the crawler will visit. _Site Crawl_ scopes are based on this URL.

When enabled, the crawler will visit all the links it finds within each page, regardless of the _Start URL Scope_ setting.
### Include Any Linked Page

When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field.

??? example "Crawling tags & search queries with Page List crawls"
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page.

### Fail Crawl on Failed URL

When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed".

### Max Depth in Scope

Instructs the crawler to stop visiting new links past a specified depth.

### Extra URL Prefixes in Scope

This field accepts additional URLs or domains that will be crawled if URLs that lead to them are found.

This can be useful for crawling websites that span multiple domains such as `example.org` and `example.net`.

### Include Any Linked Page ("one hop out")

When enabled, the crawler bypasses the _Crawl Scope_ setting to visit links it finds in each page within scope. The crawler will not visit links it finds in the pages found outside of scope (hence only "one hop out".)

This can be useful for capturing links on a page that lead outside the website that is being crawled but should still be included in the archive for context.

#### Check For Sitemap
### Check For Sitemap

When enabled, the crawler will check for a sitemap at /sitemap.xml and use it to discover pages to crawl if found. It will not crawl pages found in the sitemap that do not meet the crawl's scope settings or limits.

This can be useful for discovering and capturing pages on a website that aren't linked to from the seed and which might not otherwise be captured.

### Exclusions
### Additional Pages

A list of page URLs outside of the _Crawl Scope_ to include in the crawl.

### Exclude Pages

The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when _Include Any Linked Page_ is enabled.

Expand Down
3 changes: 2 additions & 1 deletion frontend/sample.env.local
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
API_BASE_URL=
GLITCHTIP_DSN=
DOCS_URL=https://docs.browsertrix.com/
GLITCHTIP_DSN=
Loading

0 comments on commit 612bbb6

Please sign in to comment.