TLS-Covid19 dataset

What

The TLS-Covid19 is a multi-lingual and multi-document Timeline Summarization (TLS) annotated dataset built to foster the emergence and evaluation of new algorithms, and, at the same time, enable the study of news coverage about the COVID-19 pandemic. It consists of a number of curated topics related to the Covid-19 outbreak, with associated news articles from portuguese and english news outlets and their respective reference timelines as gold-standard. The following figure shows the format and the structure of the dataset.

Why

The rise of social media and the explosion of digital news in the web sphere have created new challenges to extract knowledge and make sense of published information. Automated timeline generation appears in this context as a promising answer to help users dealing with this information overload problem. Formally, Timeline Summarization (TLS) can be defined as a subtask of Multi-Document Summarization (MDS) conceived to highlight the most important information during the development of a story over time by summarizing long-lasting events in a timely ordered fashion. As opposed to traditional MDS, however, TLS has a limited number of publicly available datasets. This lack of datasets is even more noticeable for low resource languages, including Portuguese, which despite being the sixth most spoken language in the world [Ethnologue (2019, 22nd edition)] lacks a specific TLS dataset.

Following the worldwide coverage of the coronavirus pandemic, we propose the TLS-Covid19 dataset, a novel corpus for the Portuguese and English languages.

How

To create this dataset, we take advantage of liveblogs, a webpage where news media outlets offer a daily live coverage about an ongoing event. Each liveblog (usually with a different URL) consists of a set of news stories and a set of key moments. The key moments stories are manually selected by journalists from the whole set of news articles, thus giving rise to the ground-truth timeline.

Data Sources

We consider two Portuguese news sources, Público and Observador, and two English news sources, CNN and The Guardian.

As a rule-of-thumb, we consider the beginning of the liveblog coverage as the start time-period. For instance, Público liveblog is tracked since March 16, 2020; Observador since January 30, 2020; CNN since January 22, 2020; and The Guardian since January 24, 2020. Our aim is to continue expanding the dataset with further articles and possibly new topics until the end of the outbreak and/or the end of the liveblogs’ coverage. We anticipate that as the pandemic evolves, the amount of data collected will grow significantly.

The source code to reproduce the dataset is available in a Google Colab notebook. Try it here:

Statistics

As of December 31, 2020

The following tables describe detailed statistics about the dataset. As of the date of December 31, 2020, we have collected 143 common topics for Publico and Observador, and 35 common topics for CNN and The Guardian.

By news source:

			Input Docs				Ground-Truth			Compression
Sources	#Topics	Lang	#Docs	Avg #sents	Avg #dates	Avg sents/dates	Avg #sents	Avg #dates	Avg sents/dates	Sents	Dates
Público	143	PT	28,327	1092.15	99.93	10.93	62.82	40.05	1.57	5.75	40.08
Observador	143	PT	40,181	1653.22	120.52	13.72	114.90	57.77	1.99	6.95	47.93
CNN	35	EN	26,043	6178.54	189.71	32.57	30.11	20.97	1.44	0.49	11.05
Guardian	35	EN	5,848	1118.86	80.69	13.87	25.26	21.97	1.15	2.26	27.23

By news source language:

		Input Docs				Ground-Truth			Compression
Lang	#Topics	#Docs	Avg #sents	Avg #dates	Avg sents/dates	Avg #sents	Avg #dates	Avg sents/dates	Sents	Dates
PT	143	68,508	1372.69	110.23	12.45	88.86	48.91	1.82	6.47	44.37
EN	35	31,891	3648.70	135.20	26.99	27.69	21.47	1.29	0.76	15.89

Distribution of topics by type:

Type	PT	EN
PER	17	3
ORG	33	6
LOC	82	25
KW	11	1

WordClouds EN/PT:

Use Cases

The TLS-Covid19 allows one to see the evolution of a topic over time and to compare what is being said about a certain topic by different news outlets.

One can also look at keywords, part-of-speech tags, entities or events to see how things have changed over time.

As is common with most of the datasets of this kind, one can also look at collocates. A few examples might be: keywords that were common in the same time-period, words that appear near covid-19 in different time-periods, entites, events, nouns or verbs that were more common at the beginning of the pandemics than in December 2020.

Finally, one can also create a sub-set of the dataset based on the publication date, the source, the country, etc.

Publication

Pasquali, A., Campos, R., Ribeiro, A., Santana, B., Jorge, A., and Jatowt, A. (2021). TLS-Covid19: A New Annotated Corpus for Timeline Summarization. In: Hiemstra D., Moens M-F., Mothe J., Perego R., Potthast M., Sebastiani F. (eds), Advances in Information Retrieval. ECIR'21 (Lucca, Italy. March 28 - April 1). Lecture Notes in Computer Science, vol 12656, pp. 497 - 512. ECIR21 presentation

Contact

For further information related to the TLS-Covid19 dataset please contact Alexandre Ribeiro ([email protected]), Arian Pasquali ([email protected]), or Ricardo Campos ([email protected]).

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
tls_covid19.ipynb		tls_covid19.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TLS-Covid19 dataset

What

Why

How

Data Sources

Statistics

Use Cases

Publication

Contact

About

Releases

Packages

Languages

License

bkorycki/tls-covid19

Folders and files

Latest commit

History

Repository files navigation

TLS-Covid19 dataset

What

Why

How

Data Sources

Statistics

Use Cases

Publication

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages