Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --jsonl option #35

Merged
merged 20 commits into from
Jan 22, 2024
Merged

Add --jsonl option #35

merged 20 commits into from
Jan 22, 2024

Conversation

jelmervdl
Copy link
Member

@jelmervdl jelmervdl commented Feb 17, 2023

This is mostly to do some metadata analysis of the warcs, but could be a starting point for #34 as well.

For metadata I'm considering trying out writing to parquet directly. But since warc2text is run in parallel we'd still need to merge parquet files together before doing any analysis. So maybe jsonl is sufficient for this stage. And then we ingest all of those together into a massive parquet file for queries later.

Current output: each line contains a JSON object that consists of:

  • f: filename of warc file
  • o: byte offset of record in warc file
  • s: warc file record size
  • rs: byte size of record payload (uncompressed)
  • ps: byte size of text only payload (so compare this against rs and you should get amount of HTML removed)
  • l: identified language by classifier
  • u: url
  • c: content type as reported by the HTTP response header (or warc record header if that isn't present)
  • p: plain text

Todo:

  • ts: crawl date as found in the record header (no date normalisation or anything)
  • pt: per paragraph/line in p the most nested tag it was found in:
    Should this be an array of strings? Or a string separated by newlines to match p?
  • pi: paragraph identifiers as normally produced by get_paragraph_id()
    Same question as for pt, or even just keep this function as-is and add the paragraph identifiers inside p which is a real mess but might be easiest for compatibility?

    Moving these things to Track html tags #46.

I also want to make these new columns available to the original bitext output as possible arguments for -f.

--multilang is also supported for the CLD2 classifier. In that case you'd get multiple json lines per record, one for each identified language. The attributes that relate to the record itself will be duplicated, only p, ps and l differ.

Usage:

> ll *.warc.gz
 Size Name
1.1Gi CC-MAIN-20221126080725-20221126110725-00000.warc.gz
1.1Gi WIDE-20171021194807-00260.warc.gz

> bin/warc2text --jsonl *.warc.gz | pigz -9c > metadata.jsonl.gz
[2023-02-17 14:41:02.338945] [info] Processing CC-MAIN-20221126080725-20221126110725-00000.warc.gz
[2023-02-17 14:42:02.002268] [info] Processing WIDE-20171021194807-00260.warc.gz
[2023-02-17 14:42:13.112524] [info] total records: 46660
[2023-02-17 14:42:13.112559] [info] text records: 44405
[2023-02-17 14:42:13.112567] [info] lang records: 40914
[2023-02-17 14:42:13.112574] [info] total bytes: 1456844861
[2023-02-17 14:42:13.112580] [info] text bytes: 328455828
[2023-02-17 14:42:13.112587] [info] lang bytes: 285338976
[2023-02-17 14:42:13.112593] [info] elapsed: 0h1m10s

> ll metadata.jsonl.gz
 Size Name
2.1Mi metadata.jsonl.gz

So 2Gb of warc yields about 2Mb of jsonlines.

Getting actual metadata from it:

> pigz -cd metadata.jsonl.gz | jq --raw-output .u | head
http://0337.maymay520.com/V4/?AID=164332&FID=1782326&WEBID=AVSHOW
http://064.ehiroba.jp/shopdetail/000000000660/ct91/page2/order/
http://095160170158.vectranet.pl/wiadomosci/item/12047-obchody-czerwca-76-z-rekomendacja-komisji-kultury
http://1118.cctv.com/2019/12/30/VIDErVoqOYK0GveK5J2BaDvq191230.shtml
http://114hzw.com/zhanhuipaiqi/industry/jiajujiaji/
http://120rcw.com/about/jinjia.html
http://123nu.dk/lystfiskeri/forum/registration_rules.asp?FID=0&SID=3cae74fcfz339af2f3f86321e46511e3
http://123stopfire.com/Fra/Fr_p1_01.html
http://1368.info/soi-cau-3-cang/
http://1801202223.djtom.cz/%D9%8A%D9%85%D9%83%D9%86-%D8%A3-%D9%8A%D9%83%D9%88%D9%86-%D9%86%D8%B8%D8%B1%D8%A7.html/

@lpla lpla self-requested a review February 20, 2023 11:22
@lpla lpla self-assigned this Feb 20, 2023
WARCPreprocessor is already complicated enough as is. No need to pass in all those options just to construct a writer.
@jelmervdl
Copy link
Member Author

jelmervdl commented Mar 14, 2023

There might be something wrong with the byte offsets. I've not yet been able to use something like tail -c +${OFFSET} | zless to jump to a particular record, and I've also not yet figured out why my offsets would be wrong.

Edit: I'm bad at reading. It starts showing at $OFFSET, so I need to jump to tail -c +$((OFFSET + 1)) | zless and tada it works.

Also I'm not storing compressed record size, which would be helpful when dding parts of warcs to create a selection. Technically you can look at the offset of the next record, but we're also skipping records that aren't interesting, so those differences are not always just the size of 1 record.

Also also I would like to have some content hashing so we can detect (near?) duplicates from the metadata. Google used to use simhash back in the day to remove duplicate search results. Not sure whether they have a better method these days. Definitely anything multilingual would be too expensive to run anyway.

# Conflicts:
#	CMakeLists.txt
#	src/bilangwriter.hh
#	src/warcpreprocessor.cc
#	src/warcpreprocessor.hh
#	warc2text_main.cc
Which contains the `warc_path:offset:size` for each line.
@jelmervdl jelmervdl changed the title Add --jsonl option that prints only metadata Add --jsonl option Nov 2, 2023
@jelmervdl jelmervdl linked an issue Nov 2, 2023 that may be closed by this pull request
I know, more classes, but each one is significantly simpler 🎉
# Conflicts:
#	src/warcpreprocessor.cc
#	src/warcpreprocessor.hh
#	warc2text_main.cc
@jelmervdl jelmervdl mentioned this pull request Nov 9, 2023
@jelmervdl jelmervdl marked this pull request as ready for review November 9, 2023 12:40
# Conflicts:
#	src/warcpreprocessor.cc
#	src/warcpreprocessor.hh
@jelmervdl jelmervdl removed the request for review from lpla November 9, 2023 13:45
@nvanva
Copy link
Contributor

nvanva commented Nov 24, 2023

Previously warc2text saved texts and urls in parallel files text.gz and url.gz in directories with language codes as names. To save timestamps for documents we need running warc2text with --jsonl flag. This results in all texts and meta information just written to stdout. This breaks the current pipelines and requires modifications of further steps (probably writing additional scripts doing exactly what warc2text does without --jsonl, i.e. duplicating logic already implemented in warc2text?). An alternative may be running warc2text two times, with and without --jsonl flag, but this requires 2x more time and disk space.
At least for the purposes of filtering by robots.txt, would it be possible to have an option of just saving timestamps in a file paralllel to text.gz and url.gz?

@jelmervdl
Copy link
Member Author

To save timestamps for documents we need running warc2text with --jsonl flag.

You can use -f text,url,date to output the date save the timestamps to a date.gz file.

This bit isn't entirely clear from the pull request, but in the updated readme it shows that I've also added options to produce all the new metadata in the old format.

Be sure to run with a ulimit -n unlimited or something really high still when using the bitextor output format.

@jelmervdl
Copy link
Member Author

I'm having some reservations about whether the JSON output is valid UTF-8 (as I'm processing some of the output with Python and noticing issues). None of the code should produce invalid utf-8 as far as I can tell, but … keep an eye out for this when reviewing. I'll also look a bit more into that.

@akutuzov
Copy link

Hi,

Will this PR be merged?

@ZJaume
Copy link
Member

ZJaume commented Dec 21, 2023

Trying this branch I've seen a remarkable regression in speed. If it is meant to be like this because some feature, it is still a speed that we can afford, I guess. But wanted to point this out in case there is something badly optimized.

Batch: WIDE-20121227150417-crawl413

WARC ID / Method 150417 153314 12260 154314 12261 155838 161939 165509 171541 172947
warc2text master 60s 48s 42s 30s 48s 34s 11s 31s
warc2text metadata-only 130s 105s 88s 65s 99s 74s 11s 66s
warc2test metadata-only --jsonl 121s 99s 80s 60s 93s 69s 11s 51s

The full command:

./warc2text_json/build/bin/warc2text \
--classifier fasttext --fasttext-model lid218e.bin \
--url-filters warc2text-runner/url-filter-list.optimised \
--tag-filters warc2text-runner/mt-filter-list.annotated \
--paragraph-identification -o text/ \
--silent --jsonl $i >/dev/null

@jelmervdl
Copy link
Member Author

jelmervdl commented Dec 21, 2023

Did you compare fasttext to fasttext, and the non-jsonl command without --jsonl? --jsonl takes precedence over --output/-o.

I ran it locally, with a fresh checkout of master and this branch (with the last changes to master merged in, so same fastertext) and all speeds are pretty comparable for me:

branch: master, bitext
real	6m34.636s
user	6m30.540s
sys	0m3.178s

branch: metadata-only, bitext
real	6m37.463s
user	6m32.968s
sys	0m3.247s

branch: metadata-only, jsonl
real	6m11.547s
user	6m19.867s
sys	0m3.391s

Benchmark I ran (single run on my battery powered laptop, but laptop is not throttling or anything so I trust it):

#!/bin/bash
set -euo pipefail
ulimit -f unlimited

profile() {
  local prog=$1
  local output=$2
  shift 2
  $prog \
    --classifier fasttext \
    --fasttext-model lid201-model.ftz \
    --url-filters ../warc2text-runner/url-filter-list.optimised \
    --tag-filters ../warc2text-runner/mt-filter-list.annotated \
    --paragraph-identification \
    --output $output \
    --silent WIDE-20180405202949-00696.warc.gz \
    "$@"
}

echo "branch: master, bitext"
rm -rf out-main
time profile ../warc2text/build-master/bin/warc2text out-main/

echo "branch: metadata-only, bitext"
rm -rf out-json
time profile ../warc2text/build/bin/warc2text out-json/

echo "branch: metadata-only, jsonl"
time profile ../warc2text/build/bin/warc2text out-json --jsonl | gzip -9c > out-json.jsonl.gz

Edit: for fun, output sizes!

du -sh out-*
104M	out-main
104M	out-json
 52M	out-json.jsonl.gz

@jelmervdl
Copy link
Member Author

jelmervdl commented Dec 21, 2023

However, this issue still exists:

gzip -cd out-json.jsonl.gz | python3 -c "import json,sys
                                                for line in sys.stdin:
                                                    json.loads(line)
                                                "

Gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 5022-5023: invalid continuation byte

Edit: this seems to be the case because the JSON output contains the entire payload under the p key, which doesn't need to be utf-8 because the logic about how and when to convert which bit of data is pretty messy:

warc2text/src/record.cc

Lines 222 to 243 in 8be9393

// remove HTML tags:
if (isPlainText) {
// convert to utf8 if needed (we do it before cleaning tabs, unlike HTML below):
if (needToConvert)
payload = util::toUTF8(payload, charset);
util::trimLinesCopy(payload, extracted);
std::replace_if(extracted.begin(), extracted.end(), [](wchar_t c){ return std::iscntrl(c) && c != '\n'; }, ' ');
}
else {
retval = processHTML(payload, extracted, tagFilters);
// convert to utf8 if needed:
if (needToConvert)
extracted = util::toUTF8(extracted, charset);
}
// decode HTML entities:
if (isPlainText)
plaintext = extracted;
else
entities::decodeEntities(extracted, plaintext);

(Note that extracted is a temp var in this snippet, the payload is the p key in json and plaintext is the t key.)

(Also mentioning #48 here, but that's not a solution since valid JSON always has to be valid UTF-8 and apparently the Boost library I'm using does not guarantee that bit, i.e. uses escape sequences to encode the invalid byte sequence.)

@ZJaume
Copy link
Member

ZJaume commented Dec 22, 2023

The speed regression seems to be solved now. Synced with master brought back fast execution times.

Regarding the invalid UTF-8, I think JSON does not change things much, compared to what we had before. The error that you are getting with python is not exactly a JSON parsing error. The json.loads will receive always valid UTF-8 in that loop, because iterating over sys.stdin already calls implicit .decode('utf8'). That exception is, therefore thrown by sys.stdin because probably your environment has errors="strict" as default (?).

If I do:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     json.loads(i.strip())'

because in the env I use (idk if this is a default difference between Mac and Linux, or depends on other things), errors="escapesurrogate" is the default. It gives:

Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 1870: invalid continuation byte

But without reconfigure (because escape is the default in my env) or explicitly using errors="replace" or errors="escapesurrogate", I don't get any error.

Also, If I just read the input without any JSON parsing or read from the base64 input, I get the same decoding errors:

zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                       for i in sys.stdin:
                                                           continue'
zcat text/*/text.gz | base64 -d | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
                                                 for i in sys.stdin:
                                                     continue'
Traceback (most recent call last):
  File "<string>", line 2, in <module>
  File "/usr/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 4432: invalid continuation byte

So, in the end, I think it depends on what the downstream tool decides to do. For example jq is able to parse all the JSONs I've generated because it replaces the invalid characters with the surrogate. Probably the UTF-8 invalid character handling discussion can be moved to other place other than this PR.

@ZJaume
Copy link
Member

ZJaume commented Jan 22, 2024

Although we may not use the output as it is designed here, the PR seems to be stable enough and it doesn't interfere with Bitextor format. So I'm merging it.

@ZJaume ZJaume merged commit 6a514b4 into bitextor:master Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

alternative output format based on JSONlines
5 participants