-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add --jsonl
option
#35
Conversation
WARCPreprocessor is already complicated enough as is. No need to pass in all those options just to construct a writer.
There might be something wrong with the byte offsets. I've not yet been able to use something like Edit: I'm bad at reading. It starts showing at $OFFSET, so I need to jump to Also I'm not storing compressed record size, which would be helpful when Also also I would like to have some content hashing so we can detect (near?) duplicates from the metadata. Google used to use simhash back in the day to remove duplicate search results. Not sure whether they have a better method these days. Definitely anything multilingual would be too expensive to run anyway. |
# Conflicts: # CMakeLists.txt # src/bilangwriter.hh # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc
Which contains the `warc_path:offset:size` for each line.
--jsonl
option that prints only metadata--jsonl
option
I know, more classes, but each one is significantly simpler 🎉
# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh # warc2text_main.cc
fd6ff68
to
1217d71
Compare
# Conflicts: # src/warcpreprocessor.cc # src/warcpreprocessor.hh
Previously warc2text saved texts and urls in parallel files text.gz and url.gz in directories with language codes as names. To save timestamps for documents we need running warc2text with --jsonl flag. This results in all texts and meta information just written to stdout. This breaks the current pipelines and requires modifications of further steps (probably writing additional scripts doing exactly what warc2text does without --jsonl, i.e. duplicating logic already implemented in warc2text?). An alternative may be running warc2text two times, with and without --jsonl flag, but this requires 2x more time and disk space. |
You can use This bit isn't entirely clear from the pull request, but in the updated readme it shows that I've also added options to produce all the new metadata in the old format. Be sure to run with a |
I'm having some reservations about whether the JSON output is valid UTF-8 (as I'm processing some of the output with Python and noticing issues). None of the code should produce invalid utf-8 as far as I can tell, but … keep an eye out for this when reviewing. I'll also look a bit more into that. |
Hi, Will this PR be merged? |
Trying this branch I've seen a remarkable regression in speed. If it is meant to be like this because some feature, it is still a speed that we can afford, I guess. But wanted to point this out in case there is something badly optimized. Batch:
The full command:
|
Did you compare fasttext to fasttext, and the non-jsonl command without I ran it locally, with a fresh checkout of master and this branch (with the last changes to master merged in, so same fastertext) and all speeds are pretty comparable for me:
Benchmark I ran (single run on my battery powered laptop, but laptop is not throttling or anything so I trust it): #!/bin/bash
set -euo pipefail
ulimit -f unlimited
profile() {
local prog=$1
local output=$2
shift 2
$prog \
--classifier fasttext \
--fasttext-model lid201-model.ftz \
--url-filters ../warc2text-runner/url-filter-list.optimised \
--tag-filters ../warc2text-runner/mt-filter-list.annotated \
--paragraph-identification \
--output $output \
--silent WIDE-20180405202949-00696.warc.gz \
"$@"
}
echo "branch: master, bitext"
rm -rf out-main
time profile ../warc2text/build-master/bin/warc2text out-main/
echo "branch: metadata-only, bitext"
rm -rf out-json
time profile ../warc2text/build/bin/warc2text out-json/
echo "branch: metadata-only, jsonl"
time profile ../warc2text/build/bin/warc2text out-json --jsonl | gzip -9c > out-json.jsonl.gz Edit: for fun, output sizes!
|
However, this issue still exists: gzip -cd out-json.jsonl.gz | python3 -c "import json,sys
for line in sys.stdin:
json.loads(line)
" Gives:
Edit: this seems to be the case because the JSON output contains the entire payload under the Lines 222 to 243 in 8be9393
(Note that extracted is a temp var in this snippet, the payload is the p key in json and plaintext is the t key.)
(Also mentioning #48 here, but that's not a solution since valid JSON always has to be valid UTF-8 and apparently the Boost library I'm using does not guarantee that bit, i.e. uses escape sequences to encode the invalid byte sequence.) |
The speed regression seems to be solved now. Synced with master brought back fast execution times. Regarding the invalid UTF-8, I think JSON does not change things much, compared to what we had before. The error that you are getting with python is not exactly a JSON parsing error. The If I do: zcat text.jsonl.gz | python -c 'import sys; import json; sys.stdin.reconfigure(errors="strict")
for i in sys.stdin:
json.loads(i.strip())' because in the env I use (idk if this is a default difference between Mac and Linux, or depends on other things),
But without reconfigure (because escape is the default in my env) or explicitly using Also, If I just read the input without any JSON parsing or read from the base64 input, I get the same decoding errors:
So, in the end, I think it depends on what the downstream tool decides to do. For example |
Although we may not use the output as it is designed here, the PR seems to be stable enough and it doesn't interfere with Bitextor format. So I'm merging it. |
This is mostly to do some metadata analysis of the warcs, but could be a starting point for #34 as well.
For metadata I'm considering trying out writing to parquet directly. But since warc2text is run in parallel we'd still need to merge parquet files together before doing any analysis. So maybe jsonl is sufficient for this stage. And then we ingest all of those together into a massive parquet file for queries later.
Current output: each line contains a JSON object that consists of:
f
: filename of warc fileo
: byte offset of record in warc files
: warc file record sizers
: byte size of record payload (uncompressed)ps
: byte size of text only payload (so compare this againstrs
and you should get amount of HTML removed)l
: identified language by classifieru
: urlc
: content type as reported by the HTTP response header (or warc record header if that isn't present)p
: plain textTodo:
ts
: crawl date as found in the record header (no date normalisation or anything)pt
: per paragraph/line inp
the most nested tag it was found in:Should this be an array of strings? Or a string separated by newlines to match
p
?pi
: paragraph identifiers as normally produced byget_paragraph_id()
Same question as for
pt
, or even just keep this function as-is and add the paragraph identifiers insidep
which is a real mess but might be easiest for compatibility?Moving these things to Track html tags #46.
I also want to make these new columns available to the original bitext output as possible arguments for
-f
.--multilang
is also supported for the CLD2 classifier. In that case you'd get multiple json lines per record, one for each identified language. The attributes that relate to the record itself will be duplicated, onlyp
,ps
andl
differ.Usage:
So 2Gb of warc yields about 2Mb of jsonlines.
Getting actual metadata from it: