GitHub - ductri/reuters_loader: Load and convert dataset RCV1-v2 to csv file

RCV1-v2 dataset is described at RCV1-v2 info
Because it is required signing Agreement to obtain data, this repo contains no data. Personally, it only took me 1 day to be granted access to the data.
Unfortunately, the downloaded file is neither in handy format, nor there are obvious ways to extract desirable info from it, while all I expect is a csv file contains something likes ID, text, labels. As a result, this script serves that intent.

Basically, just run:

python main.py path_to_dir

where path_to_dir is the absolute path to the directory containing file rcv1.tar.xz. It would output 2 csv files at path_to_dir:

The content in column text are raw text in xml format. It can be parsed easily with xml.etree.ElementTree.XML(text)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py

Provide feedback