Add `--jsonl` option #35

jelmervdl · 2023-02-17T15:04:19Z

This is mostly to do some metadata analysis of the warcs, but could be a starting point for #34 as well.

For metadata I'm considering trying out writing to parquet directly. But since warc2text is run in parallel we'd still need to merge parquet files together before doing any analysis. So maybe jsonl is sufficient for this stage. And then we ingest all of those together into a massive parquet file for queries later.

Current output: each line contains a JSON object that consists of:

f: filename of warc file
o: byte offset of record in warc file
s: warc file record size
rs: byte size of record payload (uncompressed)
ps: byte size of text only payload (so compare this against rs and you should get amount of HTML removed)
l: identified language by classifier
u: url
c: content type as reported by the HTTP response header (or warc record header if that isn't present)
p: plain text

Todo:

ts: crawl date as found in the record header (no date normalisation or anything)
pt: per paragraph/line in p the most nested tag it was found in:
Should this be an array of strings? Or a string separated by newlines to match p?
pi: paragraph identifiers as normally produced by get_paragraph_id()
Same question as for pt, or even just keep this function as-is and add the paragraph identifiers inside p which is a real mess but might be easiest for compatibility?
Moving these things to Track html tags #46.

I also want to make these new columns available to the original bitext output as possible arguments for -f.

--multilang is also supported for the CLD2 classifier. In that case you'd get multiple json lines per record, one for each identified language. The attributes that relate to the record itself will be duplicated, only p, ps and l differ.

Usage:

> ll *.warc.gz
 Size Name
1.1Gi CC-MAIN-20221126080725-20221126110725-00000.warc.gz
1.1Gi WIDE-20171021194807-00260.warc.gz

> bin/warc2text --jsonl *.warc.gz | pigz -9c > metadata.jsonl.gz
[2023-02-17 14:41:02.338945] [info] Processing CC-MAIN-20221126080725-20221126110725-00000.warc.gz
[2023-02-17 14:42:02.002268] [info] Processing WIDE-20171021194807-00260.warc.gz
[2023-02-17 14:42:13.112524] [info] total records: 46660
[2023-02-17 14:42:13.112559] [info] text records: 44405
[2023-02-17 14:42:13.112567] [info] lang records: 40914
[2023-02-17 14:42:13.112574] [info] total bytes: 1456844861
[2023-02-17 14:42:13.112580] [info] text bytes: 328455828
[2023-02-17 14:42:13.112587] [info] lang bytes: 285338976
[2023-02-17 14:42:13.112593] [info] elapsed: 0h1m10s

> ll metadata.jsonl.gz
 Size Name
2.1Mi metadata.jsonl.gz

So 2Gb of warc yields about 2Mb of jsonlines.

Getting actual metadata from it:

> pigz -cd metadata.jsonl.gz | jq --raw-output .u | head
http://0337.maymay520.com/V4/?AID=164332&FID=1782326&WEBID=AVSHOW
http://064.ehiroba.jp/shopdetail/000000000660/ct91/page2/order/
http://095160170158.vectranet.pl/wiadomosci/item/12047-obchody-czerwca-76-z-rekomendacja-komisji-kultury
http://1118.cctv.com/2019/12/30/VIDErVoqOYK0GveK5J2BaDvq191230.shtml
http://114hzw.com/zhanhuipaiqi/industry/jiajujiaji/
http://120rcw.com/about/jinjia.html
http://123nu.dk/lystfiskeri/forum/registration_rules.asp?FID=0&SID=3cae74fcfz339af2f3f86321e46511e3
http://123stopfire.com/Fra/Fr_p1_01.html
http://1368.info/soi-cau-3-cang/
http://1801202223.djtom.cz/%D9%8A%D9%85%D9%83%D9%86-%D8%A3-%D9%8A%D9%83%D9%88%D9%86-%D9%86%D8%B8%D8%B1%D8%A7.html/

WARCPreprocessor is already complicated enough as is. No need to pass in all those options just to construct a writer.

jelmervdl · 2023-03-14T12:35:40Z

There might be something wrong with the byte offsets. I've not yet been able to use something like tail -c +${OFFSET} | zless to jump to a particular record, and I've also not yet figured out why my offsets would be wrong.

Edit: I'm bad at reading. It starts showing at $OFFSET, so I need to jump to tail -c +$((OFFSET + 1)) | zless and tada it works.

Also I'm not storing compressed record size, which would be helpful when dding parts of warcs to create a selection. Technically you can look at the offset of the next record, but we're also skipping records that aren't interesting, so those differences are not always just the size of 1 record.

Also also I would like to have some content hashing so we can detect (near?) duplicates from the metadata. Google used to use simhash back in the day to remove duplicate search results. Not sure whether they have a better method these days. Definitely anything multilingual would be too expensive to run anyway.

# Conflicts: # CMakeLists.txt # src/bilangwriter.hh # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc

Which contains the `warc_path:offset:size` for each line.

# Conflicts: # src/warcreader.hh

I know, more classes, but each one is significantly simpler 🎉

# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc

# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh

nvanva · 2023-11-24T12:11:35Z

Previously warc2text saved texts and urls in parallel files text.gz and url.gz in directories with language codes as names. To save timestamps for documents we need running warc2text with --jsonl flag. This results in all texts and meta information just written to stdout. This breaks the current pipelines and requires modifications of further steps (probably writing additional scripts doing exactly what warc2text does without --jsonl, i.e. duplicating logic already implemented in warc2text?). An alternative may be running warc2text two times, with and without --jsonl flag, but this requires 2x more time and disk space.
At least for the purposes of filtering by robots.txt, would it be possible to have an option of just saving timestamps in a file paralllel to text.gz and url.gz?

jelmervdl · 2023-11-24T12:25:04Z

To save timestamps for documents we need running warc2text with --jsonl flag.

You can use -f text,url,date to output the date save the timestamps to a date.gz file.

This bit isn't entirely clear from the pull request, but in the updated readme it shows that I've also added options to produce all the new metadata in the old format.

Be sure to run with a ulimit -n unlimited or something really high still when using the bitextor output format.

jelmervdl · 2023-12-01T16:32:24Z

I'm having some reservations about whether the JSON output is valid UTF-8 (as I'm processing some of the output with Python and noticing issues). None of the code should produce invalid utf-8 as far as I can tell, but … keep an eye out for this when reviewing. I'll also look a bit more into that.

akutuzov · 2023-12-20T16:21:01Z

Hi,

Will this PR be merged?

ZJaume · 2023-12-21T16:46:43Z

Trying this branch I've seen a remarkable regression in speed. If it is meant to be like this because some feature, it is still a speed that we can afford, I guess. But wanted to point this out in case there is something badly optimized.

Batch: WIDE-20121227150417-crawl413

WARC ID / Method	150417	153314 12260	154314 12261	155838	161939	165509	171541	172947
warc2text master	60s	48s	42s	30s	48s	34s	11s	31s
warc2text metadata-only	130s	105s	88s	65s	99s	74s	11s	66s
warc2test metadata-only `--jsonl`	121s	99s	80s	60s	93s	69s	11s	51s

The full command:

./warc2text_json/build/bin/warc2text \
--classifier fasttext --fasttext-model lid218e.bin \
--url-filters warc2text-runner/url-filter-list.optimised \
--tag-filters warc2text-runner/mt-filter-list.annotated \
--paragraph-identification -o text/ \
--silent --jsonl $i >/dev/null

jelmervdl · 2023-12-21T22:25:30Z

Did you compare fasttext to fasttext, and the non-jsonl command without --jsonl? --jsonl takes precedence over --output/-o.

I ran it locally, with a fresh checkout of master and this branch (with the last changes to master merged in, so same fastertext) and all speeds are pretty comparable for me:

branch: master, bitext
real	6m34.636s
user	6m30.540s
sys	0m3.178s

branch: metadata-only, bitext
real	6m37.463s
user	6m32.968s
sys	0m3.247s

branch: metadata-only, jsonl
real	6m11.547s
user	6m19.867s
sys	0m3.391s

Benchmark I ran (single run on my battery powered laptop, but laptop is not throttling or anything so I trust it):

#!/bin/bash
set -euo pipefail
ulimit -f unlimited

profile() {
  local prog=$1
  local output=$2
  shift 2
  $prog \
    --classifier fasttext \
    --fasttext-model lid201-model.ftz \
    --url-filters ../warc2text-runner/url-filter-list.optimised \
    --tag-filters ../warc2text-runner/mt-filter-list.annotated \
    --paragraph-identification \
    --output $output \
    --silent WIDE-20180405202949-00696.warc.gz \
    "$@"
}

echo "branch: master, bitext"
rm -rf out-main
time profile ../warc2text/build-master/bin/warc2text out-main/

echo "branch: metadata-only, bitext"
rm -rf out-json
time profile ../warc2text/build/bin/warc2text out-json/

echo "branch: metadata-only, jsonl"
time profile ../warc2text/build/bin/warc2text out-json --jsonl | gzip -9c > out-json.jsonl.gz

Edit: for fun, output sizes!

du -sh out-*
104M	out-main
104M	out-json
 52M	out-json.jsonl.gz

jelmervdl · 2023-12-21T22:31:43Z

However, this issue still exists:

gzip -cd out-json.jsonl.gz | python3 -c "import json,sys
                                                for line in sys.stdin:
                                                    json.loads(line)
                                                "

Gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5022-5023: invalid continuation byte

Edit: this seems to be the case because the JSON output contains the entire payload under the p key, which doesn't need to be utf-8 because the logic about how and when to convert which bit of data is pretty messy:

warc2text/src/record.cc

Lines 222 to 243 in 8be9393

    
           // remove HTML tags: 
        
           if (isPlainText) { 
        
               // convert to utf8 if needed (we do it before cleaning tabs, unlike HTML below): 
        
               if (needToConvert) 
        
                   payload = util::toUTF8(payload, charset); 
        
               util::trimLinesCopy(payload, extracted); 
        
               std::replace_if(extracted.begin(), extracted.end(), [](wchar_t c){ return std::iscntrl(c) && c != '\n'; }, ' '); 
        
           } 
        
           else { 
        
               retval = processHTML(payload, extracted, tagFilters); 
        
               // convert to utf8 if needed: 
        
               if (needToConvert) 
        
                   extracted = util::toUTF8(extracted, charset); 
        
           } 
        
           // decode HTML entities: 
        
           if (isPlainText) 
        
               plaintext = extracted; 
        
           else 
        
               entities::decodeEntities(extracted, plaintext);

(Note that extracted is a temp var in this snippet, the payload is the p key in json and plaintext is the t key.)

(Also mentioning #48 here, but that's not a solution since valid JSON always has to be valid UTF-8 and apparently the Boost library I'm using does not guarantee that bit, i.e. uses escape sequences to encode the invalid byte sequence.)

ZJaume · 2023-12-22T11:26:17Z

The speed regression seems to be solved now. Synced with master brought back fast execution times.

Regarding the invalid UTF-8, I think JSON does not change things much, compared to what we had before. The error that you are getting with python is not exactly a JSON parsing error. The json.loads will receive always valid UTF-8 in that loop, because iterating over sys.stdin already calls implicit .decode('utf8'). That exception is, therefore thrown by sys.stdin because probably your environment has errors="strict" as default (?).

If I do:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     json.loads(i.strip())'

because in the env I use (idk if this is a default difference between Mac and Linux, or depends on other things), errors="escapesurrogate" is the default. It gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 1870: invalid continuation byte

But without reconfigure (because escape is the default in my env) or explicitly using errors="replace" or errors="escapesurrogate", I don't get any error.

Also, If I just read the input without any JSON parsing or read from the base64 input, I get the same decoding errors:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                       for i in sys.stdin:
                                                           continue'

zcat text/*/text.gz | base64 -d | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     continue'

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 4432: invalid continuation byte

So, in the end, I think it depends on what the downstream tool decides to do. For example jq is able to parse all the JSONs I've generated because it replaces the invalid characters with the surrogate. Probably the UTF-8 invalid character handling discussion can be moved to other place other than this PR.

ZJaume · 2024-01-22T11:24:11Z

Although we may not use the output as it is designed here, the PR seems to be stable enough and it doesn't interfere with Bitextor format. So I'm merging it.

Add --jsonl option that prints only metadata

313a038

lpla self-requested a review February 20, 2023 11:22

lpla self-assigned this Feb 20, 2023

jelmervdl added 2 commits March 13, 2023 13:10

Merge branch 'master' into metadata-only

23cf0d9

Remove ownership of RecordWriter from WARCPreprocessor

c6a193c

WARCPreprocessor is already complicated enough as is. No need to pass in all those options just to construct a writer.

jelmervdl added 4 commits March 14, 2023 16:33

Merge remote-tracking branch 'origin/master' into metadata-only

0e0b8fc

# Conflicts: # CMakeLists.txt # src/bilangwriter.hh # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc

Write compressed record size

23c1a74

Add optional file output file

b7acdae

Which contains the `warc_path:offset:size` for each line.

I'm afraid of bare pointers

97af671

jelmervdl mentioned this pull request Mar 23, 2023

alternative output format based on JSONlines #34

Closed

jelmervdl mentioned this pull request Oct 4, 2023

[META] Summary of HPLT requested changes #40

Open

8 tasks

jelmervdl added 3 commits November 2, 2023 13:14

Merge branch 'master' into metadata-only

41eeae5

# Conflicts: # src/warcreader.hh

Reimplement json output using boost, add text

84d0c4c

Remove getLanguage() since it isn't functional anymore

38f7ede

jelmervdl changed the title ~~Add --jsonl option that prints only metadata~~ Add --jsonl option Nov 2, 2023

p instead of t for consistency

57f7336

jelmervdl linked an issue Nov 2, 2023 that may be closed by this pull request

alternative output format based on JSONlines #34

Closed

jelmervdl added 3 commits November 2, 2023 15:27

Weird default

abedb52

Rework BilangWriter into a per language LangWriter

0095b59

I know, more classes, but each one is significantly simpler 🎉

Merge branch 'master' into metadata-only

827ea60

# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc

jelmervdl mentioned this pull request Nov 3, 2023

Combining --multilang and paragraph-level annotations #45

Open

jelmervdl added 2 commits November 3, 2023 17:57

Add crawl date output

6cb12b9

Little optimisations to record parsing

1217d71

jelmervdl force-pushed the metadata-only branch from fd6ff68 to 1217d71 Compare November 9, 2023 12:31

jelmervdl mentioned this pull request Nov 9, 2023

Track html tags #46

Draft

jelmervdl marked this pull request as ready for review November 9, 2023 12:40

Merge remote-tracking branch 'origin/master' into metadata-only

483b34a

# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh

jelmervdl requested review from cgr71ii and aarongaliano November 9, 2023 13:42

jelmervdl removed the request for review from lpla November 9, 2023 13:45

jelmervdl unassigned lpla Nov 9, 2023

Update cli flags and describe output

80993e8

Document --fasttext-model better

deeeaf9

Merge branch 'master' into metadata-only

8be9393

ZJaume merged commit 6a514b4 into bitextor:master Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--jsonl` option #35

Add `--jsonl` option #35

jelmervdl commented Feb 17, 2023 •

edited

Loading

jelmervdl commented Mar 14, 2023 •

edited

Loading

nvanva commented Nov 24, 2023 •

edited

Loading

jelmervdl commented Nov 24, 2023

jelmervdl commented Dec 1, 2023

akutuzov commented Dec 20, 2023

ZJaume commented Dec 21, 2023 •

edited

Loading

jelmervdl commented Dec 21, 2023 •

edited

Loading

jelmervdl commented Dec 21, 2023 •

edited

Loading

ZJaume commented Dec 22, 2023 •

edited

Loading

ZJaume commented Jan 22, 2024

Add --jsonl option #35

Add --jsonl option #35

Conversation

jelmervdl commented Feb 17, 2023 • edited Loading

jelmervdl commented Mar 14, 2023 • edited Loading

nvanva commented Nov 24, 2023 • edited Loading

jelmervdl commented Nov 24, 2023

jelmervdl commented Dec 1, 2023

akutuzov commented Dec 20, 2023

ZJaume commented Dec 21, 2023 • edited Loading

jelmervdl commented Dec 21, 2023 • edited Loading

jelmervdl commented Dec 21, 2023 • edited Loading

ZJaume commented Dec 22, 2023 • edited Loading

ZJaume commented Jan 22, 2024

Add `--jsonl` option #35

Add `--jsonl` option #35

jelmervdl commented Feb 17, 2023 •

edited

Loading

jelmervdl commented Mar 14, 2023 •

edited

Loading

nvanva commented Nov 24, 2023 •

edited

Loading

ZJaume commented Dec 21, 2023 •

edited

Loading

jelmervdl commented Dec 21, 2023 •

edited

Loading

jelmervdl commented Dec 21, 2023 •

edited

Loading

ZJaume commented Dec 22, 2023 •

edited

Loading