Skip to content

Commit

Permalink
Merge pull request #72 from oscar-project/dev-kenlm
Browse files Browse the repository at this point in the history
KenLM based content detection
  • Loading branch information
Uinelj authored Dec 13, 2022
2 parents 70616f1 + 91a0d00 commit 5e69ed6
Show file tree
Hide file tree
Showing 18 changed files with 672 additions and 84 deletions.
39 changes: 24 additions & 15 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
branches: [master, dev]
env:
CARGO_TERM_COLOR: always

jobs:
cache_test_data:
runs-on: ubuntu-latest
Expand All @@ -34,32 +34,41 @@ jobs:
- name: Fetch identification bins
if: steps.cache-lid.outputs.cache-hit != 'true'
run: |
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.189.bin
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.189.bin
- name: Fetch blocklist
if: steps.cache-blocklist.outputs.cache-hit != 'true'
run: |
mkdir -p res/blocklist
wget ftp://ftp.ut-capitole.fr/pub/reseau/cache/squidguard_contrib/blacklists.tar.gz
tar xvzf blacklists.tar.gz
mv blacklists/* res/blocklist
rm blacklists.tar.gz
rmdir blacklists
mkdir -p res/blocklist
wget https://github.com/olbat/ut1-blacklists/archive/refs/heads/master.zip
unzip master.zip
mv ut1-blacklists-master/blacklists/* res/blocklist
gzip -d res/blocklist/adult/domains.gz #adult blocklist is compressed
rm -r ut1-blacklists-master
- name: Fetch CC shards
if: steps.cache-shards.outputs.cache-hit != 'true'
run: |
mkdir -p res/shards
wget -O res/shards/0.txt.gz https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-33/segments/1659882570651.49/wet/CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
wget -O res/shards/1.txt.gz https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-33/segments/1659882570651.49/wet/CC-MAIN-20220807150925-20220807180925-00001.warc.wet.gz
mkdir -p res/shards
wget -O res/shards/0.txt.gz https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-33/segments/1659882570651.49/wet/CC-MAIN-20220807150925-20220807180925-00000.warc.wet.gz
wget -O res/shards/1.txt.gz https://data.commoncrawl.org/crawl-data/CC-MAIN-2022-33/segments/1659882570651.49/wet/CC-MAIN-20220807150925-20220807180925-00001.warc.wet.gz
- name: Install KenLM dependencies
run: sudo apt install -y libboost-all-dev libeigen3-dev
- name: Get sample KenLM model
run: |
mkdir -p res/kenlm
wget -O res/kenlm/en.arpa https://raw.githubusercontent.com/agatan/ctclib/main/data/overfit.arpa
- name: Build
run: cargo build --verbose
- name: Create test directories
run: mkdir res/corpus res/rebuilt
run: |
mkdir res/corpus
mkdir res/rebuilt
mkdir -p res/corpus/rebuild
ls res/
- name: Run tests
run: cargo test --verbose
run: RUST_BACKTRACE=1 cargo test --verbose
- name: Run cargo-tarpaulin
uses: actions-rs/[email protected]
continue-on-error: true
- name: Upload to codecov.io
uses: codecov/codecov-action@v1

1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ Cargo.lock
/dst*
/result*
lid*
res/
Loading

0 comments on commit 5e69ed6

Please sign in to comment.