Skip to content

Commit

Permalink
feat: add tg.no crawling config (#2)
Browse files Browse the repository at this point in the history
  • Loading branch information
niccofyren authored Oct 13, 2024
1 parent 7aeba7f commit 29206a7
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 1 deletion.
1 change: 0 additions & 1 deletion browsertrix-crawler/configs/tg24.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
# TODO: Adjust for new TG24 (and beyond) site url structure
seeds:
# Crawl content available via navigation and frontpage
- url: https://www.gathering.org
Expand Down
34 changes: 34 additions & 0 deletions browsertrix-crawler/configs/tgno.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Config intended to be used on new tg.no once launched. This page differs from
# previous iterations (in practice, even if not in theory) by being a single
# site gradually updated with new content and styling, rather than a new site
# each year.
seeds:
# Crawl content available via navigation and frontpage
- url: https://www.tg.no
include:
# Basic pages
- www.tg.no

# Block calls to our tracking service
blockRules:
- url: matomo.gathering.org

collection: tgno

behaviors: autoscroll,autoplay,autofetch,siteSpecific
waitUntil: load,networkidle0
generateCDX: true
combineWARCs: true
saveState: always
workers: 4
# TODO: Remove it not needed, hopefully we won't need consent flow on new site
# Minimal profile that includes consent answers
# profile: /crawls/profiles/tg24.tar.gz

# Make "live" crawling view available at 9037
newContext: window
screencastPort: 9037

warcinfo:
operator: The Gathering
hostname: www.tg.no
2 changes: 2 additions & 0 deletions wayback/startup.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ git clone https://github.com/gathering/go-archive-tg21 || (cd go-archive-tg21 ;
git clone https://github.com/gathering/go-archive-tg22 || (cd go-archive-tg22 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tg23 || (cd go-archive-tg23 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tg24 || (cd go-archive-tg24 ; git pull ; git lfs pull ; cd ..)
git clone https://github.com/gathering/go-archive-tgno || (cd go-archive-tgno ; git pull ; git lfs pull ; cd ..)

cd "$WORKDIR"

Expand All @@ -25,5 +26,6 @@ cp -r "$SOURCES/go-archive-tg21/browsertrix-crawler/crawls/collections/tg21/" "$
cp -r "$SOURCES/go-archive-tg22/browsertrix-crawler/tg22/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tg23/tg23/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tg24/tg24/" "$COLLECTIONS/"
cp -r "$SOURCES/go-archive-tgno/tgno/" "$COLLECTIONS/"

exec /docker-entrypoint.sh $@

0 comments on commit 29206a7

Please sign in to comment.