Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Display scale as number of browser windows #2057

Merged
merged 18 commits into from
Sep 6, 2024
2 changes: 2 additions & 0 deletions backend/btrixcloud/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ class SettingsResponse(BaseModel):
defaultPageLoadTimeSeconds: int

maxPagesPerCrawl: int
numBrowsers: int
maxScale: int

billingEnabled: bool
Expand Down Expand Up @@ -143,6 +144,7 @@ def main() -> None:
os.environ.get("DEFAULT_PAGE_LOAD_TIME_SECONDS", 120)
),
maxPagesPerCrawl=int(os.environ.get("MAX_PAGES_PER_CRAWL", 0)),
numBrowsers=int(os.environ.get("NUM_BROWSERS", 1)),
maxScale=int(os.environ.get("MAX_CRAWL_SCALE", 3)),
billingEnabled=is_bool(os.environ.get("BILLING_ENABLED")),
signUpUrl=os.environ.get("SIGN_UP_URL", ""),
Expand Down
1 change: 1 addition & 0 deletions backend/test/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ def test_api_settings():
"jwtTokenLifetime": 86400,
"defaultBehaviorTimeSeconds": 300,
"maxPagesPerCrawl": 4,
"numBrowsers": 2,
"maxScale": 3,
"defaultPageLoadTimeSeconds": 120,
"billingEnabled": True,
Expand Down
2 changes: 2 additions & 0 deletions chart/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,8 @@ data:

MIN_QA_CRAWLER_IMAGE: "{{ .Values.min_qa_crawler_image }}"

NUM_BROWSERS: "{{ .Values.crawler_browser_instances }}"

MAX_CRAWLER_MEMORY: "{{ .Values.max_crawler_memory }}"

ENABLE_AUTO_RESIZE_CRAWLERS: "{{ .Values.enable_auto_resize_crawlers }}"
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/crawl-workflows.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Run a crawl workflow by clicking _Run Crawl_ in the actions menu of the workflow

While crawling, the **Watch Crawl** section displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue. You can [modify the crawl live](./running-crawl.md) by adding URL exclusions or changing the number of crawling instances.

Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated [crawl scope](workflow-setup.md#scope).
Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated [crawl scope](workflow-setup.md#crawl-scope).

## Status

Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/running-crawl.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,9 @@ If the crawl queue is filled with URLs that should not be crawled, use the _Edit

Exclusions added while crawling are applied to the same exclusion table saved in the workflow's settings and will be used the next time the crawl workflow is run unless they are manually removed.

## Changing the Number of Crawler Instances
## Changing the Number of Browser Windows

Like exclusions, the [crawler instance](workflow-setup.md#crawler-instances) scale can also be adjusted while crawling. On the Watch Crawl page, press the _Edit Crawler Instances_ button, and set the desired value.
Like exclusions, the number of [browser windows](workflow-setup.md#browser-windows) can also be adjusted while crawling. On the **Watch Crawl** tab, press the _Edit Browser Windows_ button, and set the desired value.

Unlike exclusions, this change will not be applied to future workflow runs.

Expand Down
15 changes: 10 additions & 5 deletions docs/user-guide/workflow-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Changes to a setting will only apply to subsequent crawls.

Crawl settings are shown in the crawl workflow detail **Settings** tab and in the archived item **Crawl Settings** tab.

## Scope
## Crawl Scope

Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _Known URLs_ (crawl type of **URL List**) or _Automated Discovery_ (crawl type of **Seeded Crawl**) when creating a new workflow.

Expand Down Expand Up @@ -114,10 +114,6 @@ The crawl will be gracefully stopped after this set period of elapsed time.

The crawl will be gracefully stopped after reaching this set size in GB.

### Crawler Instances

Increasing the amount of crawler instances will speed up crawls by using additional browser windows to capture more pages in parallel. This will also increase the amount of traffic sent to the website and may result in a higher chance of getting rate limited.

### Page Load Timeout

Limits amount of elapsed time to wait for a page to load. Behaviors will run after this timeout only if the page is partially or fully loaded.
Expand Down Expand Up @@ -146,6 +142,15 @@ Configure the browser used to visit URLs during the crawl.

Sets the [_Browser Profile_](browser-profiles.md) to be used for this crawl.

### Browser Windows

Sets the number of browser windows that are open and visiting pages during a crawl. Increasing the number of browser windows will speed up crawls by capturing more pages in parallel.

There are some trade-offs:

- This may result in a higher chance of getting rate limited due to the increase in traffic sent to the website.
- More execution minutes will be used per-crawl.

### Crawler Release Channel

Sets the release channel of [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) to be used for this crawl. Crawls started by this workflow will use the latest crawler version from the selected release channel. Generally "Default" will be the most stable, however others may have newer features (or bugs)!
Expand Down
10 changes: 6 additions & 4 deletions frontend/src/components/ui/config-details.ts
Original file line number Diff line number Diff line change
Expand Up @@ -166,10 +166,6 @@ export class ConfigDetails extends LiteElement {
msg("Crawl Size Limit"),
renderSize(crawlConfig?.maxCrawlSize),
)}
${this.renderSetting(
msg("Crawler Instances"),
crawlConfig?.scale ? `${crawlConfig.scale}×` : "",
)}
<btrix-section-heading style="--margin: var(--sl-spacing-medium)">
<h4>${sectionStrings.perPageLimits}</h4>
</btrix-section-heading>
Expand Down Expand Up @@ -232,6 +228,12 @@ export class ConfigDetails extends LiteElement {
>`,
),
)}
${this.renderSetting(
msg("Browser Windows"),
crawlConfig?.scale && this.appState.settings
? `${crawlConfig.scale * this.appState.settings.numBrowsers}`
: "",
)}
${this.renderSetting(
msg("Crawler Channel (Exact Crawler Version)"),
capitalize(crawlConfig?.crawlerChannel || "default") +
Expand Down
55 changes: 32 additions & 23 deletions frontend/src/features/crawl-workflows/workflow-editor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1254,29 +1254,6 @@ https://archiveweb.page/images/${"logo.svg"}`}
</sl-input>
`)}
${this.renderHelpTextCol(infoTextStrings["maxCrawlSizeGB"])}
${inputCol(html`
<sl-radio-group
name="scale"
label=${msg("Crawler Instances")}
value=${this.formState.scale}
@sl-change=${(e: Event) =>
this.updateFormState({
scale: +(e.target as SlCheckbox).value,
})}
>
${map(
range(this.defaults.maxScale),
(i: number) =>
html` <sl-radio-button value="${i + 1}" size="small"
>${i + 1}×</sl-radio-button
>`,
)}
</sl-radio-group>
`)}
${this.renderHelpTextCol(
msg(`Increasing parallel crawler instances can speed up crawls, but may
increase the chances of getting rate limited.`),
)}
${this.renderSectionHeading(sectionStrings.perPageLimits)}
${inputCol(html`
<sl-input
Expand Down Expand Up @@ -1366,6 +1343,38 @@ https://archiveweb.page/images/${"logo.svg"}`}
></btrix-select-browser-profile>
`)}
${this.renderHelpTextCol(infoTextStrings["browserProfile"])}
${inputCol(html`
<sl-radio-group
name="scale"
label=${msg("Browser Windows")}
value=${this.formState.scale}
@sl-change=${(e: Event) =>
this.updateFormState({
scale: +(e.target as SlCheckbox).value,
})}
>
${when(this.appState.settings?.numBrowsers, (numBrowsers) =>
map(
range(this.defaults.maxScale),
(i: number) =>
html` <sl-radio-button value="${i + 1}" size="small"
>${(i + 1) * numBrowsers}</sl-radio-button
>`,
),
)}
</sl-radio-group>
`)}
${this.renderHelpTextCol(
html`${msg(
`Increase the number of open browser windows during a crawl. This will speed up your crawl by effectively running more crawlers at the same time.`,
)}
<a
href="https://docs.browsertrix.com/user-guide/workflow-setup/#browser-windows"
class="text-blue-600 hover:text-blue-500"
target="_blank"
>${msg("See caveats")}</a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't have to do it for this PR, but it would be really cool if docs links we link to in the app opened up in the sidebar!

>.`,
)}
${inputCol(html`
<btrix-select-crawler
.crawlerChannel=${this.formState.crawlerChannel}
Expand Down
2 changes: 1 addition & 1 deletion frontend/src/features/org/usage-history-table.ts
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ export class UsageHistoryTable extends BtrixElement {
<sl-tooltip>
<div slot="content" style="text-transform: initial">
${msg(
"Aggregated time across all crawler instances that the crawler was actively executing a crawl or QA analysis run, i.e. not in a waiting state",
"Aggregated time across all browser windows that the crawler was actively executing a crawl or QA analysis run, i.e. not in a waiting state",
)}
</div>
<sl-icon name="info-circle" style="vertical-align: -.175em"></sl-icon>
Expand Down
Loading
Loading