Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --jsonl option #35

Merged
merged 20 commits into from
Jan 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ if (NOT CMAKE_BUILD_TYPE)
set(CMAKE_BUILD_TYPE Release)
endif ()

find_package(Boost 1.71 COMPONENTS program_options log log_setup REQUIRED)
find_package(Boost 1.75 COMPONENTS program_options json log log_setup REQUIRED)

# compile executable into bin/
set(EXECUTABLE_OUTPUT_PATH ${PROJECT_BINARY_DIR}/bin)
Expand Down
43 changes: 40 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,15 @@ warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]
[ --paragraph-identification ] [ --tag-filters <filters_file> ] <warc_file>...
```
* `--output`/`-o` output folder
* `--files`/`-f` list of output files separated by commas (and without `.gz`); `text` and `url` are always written, while `mime` and `html` are optional
* `--files`/`-f` list of output files separated by commas (and without `.gz`); Options are `text`,`html`,`url`,`mime`,`file` and `date`. Defaults to `text,url`. See [output](#output).
* `--jsonl` Produce JSON Lines on stdout instead of writing to files per language.
* `--pdfpass` WARC file where PDF records will be stored
* `--robotstxtpass` WARC file where robots.txt related records will be stored
* `--encode-urls` Escape non-ascii characters that appear in the record URL with `%dd` encoding.
* `--multilang` Detect multiple languages in the document, and split the document accordingly. Only supported with CLD2 classifier.
* `--paragraph-identification` print the paragraph identifier for each sentence extracted from the HTML
* `--classifier` classifier to use: `cld2` or `fasttext`.
* `--fasttext-model` path to FastText model for fasttext classifier.
* `--classifier` classifier to use: `cld2` or `fasttext`. When `fasttext` is used, one also has to specify a model using `--fasttext-model`.
* `--fasttext-model` path to FastText model for fasttext classifier. Models can be any [FastText language identification model](https://fasttext.cc/docs/en/language-identification.html) such as [OpenLID lid201-model.ftz](https://github.com/laurieburchell/open-lid-dataset#quantised-model)
* `--tag-filters` file containing filters that are used to eliminate matching documents
* `--invert-tag-filters` output only documents that match the filter
* `--url-filters` file containing regular expressions that match urls of documents to eliminate
Expand All @@ -61,6 +65,39 @@ warc2text -o <output_folder> [ -f <output_files> ] [ --pdfpass <output_warc> ]

Lines beginning with `#` and empty lines are ignored. Any invalid filter will raise a warning message, but will not prevent other filters from being read.

## Output
When used with `--output`/`-o` (with optionally `--files`/`-f`), warc2text will
produce the following directory structure at the path specified by `--output`:

- `./{lang}/text.gz` will contain the plain text per document as base64 encoded lines. E.g. `gzip -cd en/text.gz | head -n5 | tail -n1 | base64 -d` will give you the 5th document's text.
- `./{lang}/url.gz` contains [the crawled URL](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-target-uri) for each record.
- `./{lang}/mime.gz` contains the mimetype as reported by the crawled server
- `./{lang}/html.gz` contains lines of base64 encoded HTML as returned by the server. For ePub, MS Office or ODF files this is the extracted XML.
- `./{lang}/file.gz` contains the `{filename}:{offset}:{length}` pointer to the warc archive the record was extracted from. `{offset}` and `{length}` are of the compressed data, e.g. `tail -c+{offset} < {filename} | head -c{length} | gzip -cd` will give you the original record.
- `./{lang}/date.gz` gives you the original crawl date/time as reported by the crawler. [This should be a UTC timestamp](https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-date-mandatory).

In every file, each line corresponds to the same record. E.g. the fifth line in `text.gz` and fifth line in `url.gz` together give you the text and url for a single record.

The `{lang}` part of the path is determined by the classifier (see `--classifier`) and may be a two-letter or three-letter code depending on the classifier used. See [this list](https://github.com/CLD2Owners/cld2/blob/b56fa78a2fe44ac2851bae5bf4f4693a0644da7b/internal/generated_language.cc#L647-L1262) for CLD2.

When using `--jsonl`, the output is instead a single JSON record per line, with the following keys (always in this order):
```ts
{
f: string, # filename of warc file (same as the `{filename}` part in `file.gz`)
o: number, # byte offset of record in warc file (same as `{offset}` in `file.gz`)
s: number, # warc file record size (same as `{size}` in `file.gz`)
rs: number, # byte size of record payload (uncompressed)
ps: number, # byte size of text only payload (so compare this against `rs` and you should get amount of HTML removed)
l: string, # identified language by classifier
u: string, # url
c: string, # content type as reported by the HTTP response header (or warc record header if that isn't present)
ts: string, # crawl date/time as reported by the crawler
p: string, # plain text
}
```

More keys might be added in the future (e.g. the raw HTML is not included now) and you should not expect the order of the keys to stay the same between different versions of warc2text.

## Included dependencies
HTML Tokenizer by [c-smile](https://www.codeproject.com/Articles/14076/Fast-and-Compact-HTML-XML-Scanner-Tokenizer)

Expand Down
131 changes: 79 additions & 52 deletions src/bilangwriter.cc
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,26 @@
#include "util/exception.hh"
#include <cassert>
#include <string>
#include <iomanip>
#include <boost/json.hpp>


namespace warc2text{

GzipWriter::GzipWriter() {
dest = nullptr;
compressed = 0;
s.zalloc = nullptr;
s.zfree = nullptr;
s.opaque = nullptr;
int ret = deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 31, 8, Z_DEFAULT_STRATEGY);
assert(ret == Z_OK);
buf = new unsigned char[BUFFER_SIZE];
GzipWriter::GzipWriter()
: dest(nullptr),
buf(new unsigned char[BUFFER_SIZE]) {
//
}

GzipWriter::~GzipWriter() {
if (dest) {
this->compress("", 0, Z_FINISH);
deflateEnd(&s);
std::fclose(dest);
}
if (is_open())
close();
delete[] buf;
}

void GzipWriter::compress(const char *in, std::size_t size, int flush) {
assert(is_open());
if (size == 0 && flush == Z_NO_FLUSH) return;
s.avail_in = size;
s.next_in = (Bytef *) in;
Expand All @@ -39,7 +35,7 @@ namespace warc2text{
s.next_out = buf;
ret = deflate(&s, flush);
assert(ret == Z_OK || ret == Z_STREAM_END); // Z_STREAM_END only happens if flush == Z_FINISH
compressed = BUFFER_SIZE - s.avail_out;
std::size_t compressed = BUFFER_SIZE - s.avail_out;
//written = std::fwrite(buf, 1, compressed, dest);
std::fwrite(buf, 1, compressed, dest);
// TODO error handling
Expand All @@ -52,47 +48,68 @@ namespace warc2text{
void GzipWriter::open(const std::string& filename) {
dest = std::fopen(filename.c_str(), "wb");
UTIL_THROW_IF(!dest, util::ErrnoException, "while creating " << filename);
s.zalloc = nullptr;
s.zfree = nullptr;
s.opaque = nullptr;
int ret = deflateInit2(&s, Z_DEFAULT_COMPRESSION, Z_DEFLATED, 31, 8, Z_DEFAULT_STRATEGY);
assert(ret == Z_OK);
}

void GzipWriter::close() {
compress("", 0, Z_FINISH);
deflateEnd(&s);
std::fclose(dest);
dest = nullptr;
}

void GzipWriter::write(const char* text, std::size_t size) {
this->compress(text, size, Z_NO_FLUSH);
compress(text, size, Z_NO_FLUSH);
}

void GzipWriter::writeLine(const char* text, std::size_t size) {
this->compress(text, size, Z_NO_FLUSH);
this->compress("\n", 1, Z_NO_FLUSH);
compress(text, size, Z_NO_FLUSH);
compress("\n", 1, Z_NO_FLUSH);
}

void GzipWriter::writeLine(const std::string& text) {
this->compress(text.c_str(), text.size(), Z_NO_FLUSH);
this->compress("\n", 1, Z_NO_FLUSH);
compress(text.c_str(), text.size(), Z_NO_FLUSH);
compress("\n", 1, Z_NO_FLUSH);
}

bool GzipWriter::is_open(){
return dest != nullptr;
}

void BilangWriter::write(const std::string& lang, const std::string& b64text, const std::string& url, const std::string& mime, const std::string& b64html) {
GzipWriter* gzurl = &url_files[lang];
GzipWriter* gztext = &text_files[lang];
GzipWriter* gzmime = nullptr;
GzipWriter* gzhtml = nullptr;
if (output_files.count("mime") == 1) gzmime = &(mime_files[lang]);
if (output_files.count("html") == 1) gzhtml = &(html_files[lang]);
if (!gzurl->is_open()) {
// if one file does not exist, the rest shouldn't either
std::string path = folder + "/" + lang;
util::createDirectories(path);
gzurl->open(path + "/url.gz");
gztext->open(path + "/text.gz");
if (gzmime != nullptr) gzmime->open(path + "/mime.gz");
if (gzhtml != nullptr) gzhtml->open(path + "/html.gz");
}
LangWriter::LangWriter(const std::string& path, const std::unordered_set<std::string>& output_files) {
util::createDirectories(path);

if (output_files.count("url"))
url_file.open(path + "/url.gz");
if (output_files.count("text"))
text_file.open(path + "/text.gz");
if (output_files.count("mime"))
mime_file.open(path + "/mime.gz");
if (output_files.count("html"))
html_file.open(path + "/html.gz");
if (output_files.count("file"))
file_file.open(path + "/file.gz");
if (output_files.count("date"))
date_file.open(path + "/date.gz");
}

gzurl->writeLine(url);
gztext->writeLine(b64text);
if (gzmime != nullptr) gzmime->writeLine(mime);
if (gzhtml != nullptr) gzhtml->writeLine(b64html);
void LangWriter::write(Record const &record, std::string const &chunk) {
if (url_file.is_open())
url_file.writeLine(record.getURL());
if (mime_file.is_open())
mime_file.writeLine(record.getHTTPcontentType());
if (file_file.is_open())
file_file.writeLine(record.getFilename() + ":" + std::to_string(record.getOffset()) + ":" + std::to_string(record.getSize()));
if (date_file.is_open())
date_file.writeLine(record.getWARCdate());
if (html_file.is_open())
html_file.writeLine(util::encodeBase64(record.getPayload()));
if (text_file.is_open())
text_file.writeLine(util::encodeBase64(chunk));
}

std::string get_paragraph_id(const std::string& text) {
Expand All @@ -111,23 +128,33 @@ namespace warc2text{
}

void BilangWriter::write(const Record& record, bool paragraph_identification) {
std::string base64text;
std::string base64html;

if (output_files.count("html") == 1)
util::encodeBase64(record.getPayload(), base64html);

for (const auto& it : record.getTextByLangs()) {
std::string payload = it.second;
std::string chunk = it.second;

if (paragraph_identification) {
payload = get_paragraph_id(payload);
}
if (paragraph_identification)
chunk = get_paragraph_id(chunk);

util::encodeBase64(payload, base64text);
this->write(it.first, base64text, record.getURL(), record.getHTTPcontentType(), base64html);
auto writer_it = writers.try_emplace(it.first, folder + "/" + it.first, output_files);
writer_it.first->second.write(record, chunk);
}
}

void JSONLinesWriter::write(const Record& record, [[maybe_unused]] bool paragraph_identification) {
// JSON lines format (https://jsonlines.org)
for (auto &&chunk : record.getTextByLangs()) {
out_ << boost::json::value{
{"f", boost::json::string(record.getFilename())},
{"o", boost::json::value(record.getOffset())},
{"s", boost::json::value(record.getSize())},
{"rs", boost::json::value(record.getPayload().size())},
{"ps", boost::json::value(chunk.second.size())},
{"l", boost::json::string(chunk.first)},
{"u", boost::json::string(record.getURL())},
{"c", boost::json::string(record.getHTTPcontentType())},
{"ts", boost::json::string(record.getWARCdate())},
{"p", boost::json::string(chunk.second)},
} << "\n";
}
}
}

75 changes: 47 additions & 28 deletions src/bilangwriter.hh
Original file line number Diff line number Diff line change
Expand Up @@ -3,65 +3,84 @@

#include <unordered_map>
#include <unordered_set>
#include <ostream>
#include "record.hh"
#include "zlib.h"

namespace warc2text {

/**
* Generic interface for writing records to some form of output.
*/
class RecordWriter {
public:
virtual void write(const Record& record, bool paragraph_identification = false) = 0;
virtual ~RecordWriter() = default;
};

/**
* Writer used by BilangWriter to write a single compressed file
* (i.e. a column for a specific language)
*/
class GzipWriter {
private:
FILE* dest;
z_stream s{};
unsigned char* buf;
std::size_t compressed;
void compress(const char* in, std::size_t size, int flush);

public:
GzipWriter();
~GzipWriter();
void open(const std::string& filename);
void close();
void write(const char* text, std::size_t size);
void writeLine(const char* text, std::size_t size);
void writeLine(const std::string& text);
bool is_open();
static const std::size_t BUFFER_SIZE = 4096;
};

class BilangWriter {
/**
* Writes records to a specific folder for a specific language.
*/
class LangWriter {
private:
GzipWriter url_file;
GzipWriter mime_file;
GzipWriter text_file;
GzipWriter html_file;
GzipWriter file_file;
GzipWriter date_file;
public:
LangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files);
void write(const Record& record, const std::string &chunk);
};

class BilangWriter : public RecordWriter {
private:
std::string folder;
std::unordered_map<std::string, GzipWriter> url_files;
std::unordered_map<std::string, GzipWriter> mime_files;
std::unordered_map<std::string, GzipWriter> text_files;
std::unordered_map<std::string, GzipWriter> html_files;
std::unordered_set<std::string> output_files;

void write(const std::string& lang, const std::string& b64text, const std::string& url, const std::string& mime, const std::string& b64html);

std::unordered_map<std::string, LangWriter> writers;
public:
explicit BilangWriter(const std::string& folder) :
folder(folder),
url_files(),
mime_files(),
text_files(),
html_files(),
output_files({}) // url and text are mandatory regardless
{};

explicit BilangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files) :
folder(folder),
url_files(),
mime_files(),
text_files(),
html_files(),
output_files(output_files)
{};

void write(const Record& record, bool paragraph_identification = false);
BilangWriter(const std::string& folder, const std::unordered_set<std::string>& output_files = {})
: folder(folder)
, output_files(output_files)
{
//
};

virtual void write(const Record& record, bool paragraph_identification = false);
};

class JSONLinesWriter : public RecordWriter {
private:
std::ostream &out_;
public:
explicit JSONLinesWriter(std::ostream &out) : out_(out) {};

virtual void write(const Record& record, bool paragraph_identification = false);
};
}

#endif
Loading