- RCV1-v2 dataset is described at RCV1-v2 info
- Because it is required signing Agreement to obtain data, this repo contains no data. Personally, it only took me 1 day to be granted access to the data.
- Unfortunately, the downloaded file is neither in handy format, nor there are obvious ways to extract desirable info from it, while all I expect is a
csv
file contains something likesID, text, labels
. As a result, this script serves that intent.
Basically, just run:
python main.py path_to_dir
where path_to_dir
is the absolute path to the directory containing file rcv1.tar.xz
. It would output 2 csv
files at path_to_dir
:
rcv1_v2.csv
: your main interested datarcv1_v2_topics_desc.csv
: description about topics
The content in column text
are raw text in xml format. It can be parsed easily with xml.etree.ElementTree.XML(text)