The TLS-Covid19 is a multi-lingual and multi-document Timeline Summarization (TLS) annotated dataset built to foster the emergence and evaluation of new algorithms, and, at the same time, enable the study of news coverage about the COVID-19 pandemic. It consists of a number of curated topics related to the Covid-19 outbreak, with associated news articles from portuguese and english news outlets and their respective reference timelines as gold-standard. The following figure shows the format and the structure of the dataset.
The rise of social media and the explosion of digital news in the web sphere have created new challenges to extract knowledge and make sense of published information. Automated timeline generation appears in this context as a promising answer to help users dealing with this information overload problem. Formally, Timeline Summarization (TLS) can be defined as a subtask of Multi-Document Summarization (MDS) conceived to highlight the most important information during the development of a story over time by summarizing long-lasting events in a timely ordered fashion. As opposed to traditional MDS, however, TLS has a limited number of publicly available datasets. This lack of datasets is even more noticeable for low resource languages, including Portuguese, which despite being the sixth most spoken language in the world [Ethnologue (2019, 22nd edition)] lacks a specific TLS dataset.
Following the worldwide coverage of the coronavirus pandemic, we propose the TLS-Covid19 dataset, a novel corpus for the Portuguese and English languages.
To create this dataset, we take advantage of liveblogs, a webpage where news media outlets offer a daily live coverage about an ongoing event. Each liveblog (usually with a different URL) consists of a set of news stories and a set of key moments. The key moments stories are manually selected by journalists from the whole set of news articles, thus giving rise to the ground-truth timeline.
We consider two Portuguese news sources, Público and Observador, and two English news sources, CNN and The Guardian.
As a rule-of-thumb, we consider the beginning of the liveblog coverage as the start time-period. For instance, Público liveblog is tracked since March 16, 2020; Observador since January 30, 2020; CNN since January 22, 2020; and The Guardian since January 24, 2020. Our aim is to continue expanding the dataset with further articles and possibly new topics until the end of the outbreak and/or the end of the liveblogs’ coverage. We anticipate that as the pandemic evolves, the amount of data collected will grow significantly.
The source code to reproduce the dataset is available in a Google Colab notebook. Try it here:
As of December 31, 2020
The following tables describe detailed statistics about the dataset. As of the date of December 31, 2020, we have collected 143 common topics for Publico and Observador, and 35 common topics for CNN and The Guardian.
By news source:
Input Docs | Ground-Truth | Compression | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sources | #Topics | Lang | #Docs | Avg #sents | Avg #dates | Avg sents/dates | Avg #sents | Avg #dates | Avg sents/dates | Sents | Dates |
Público | 143 | PT | 28,327 | 1092.15 | 99.93 | 10.93 | 62.82 | 40.05 | 1.57 | 5.75 | 40.08 |
Observador | 143 | PT | 40,181 | 1653.22 | 120.52 | 13.72 | 114.90 | 57.77 | 1.99 | 6.95 | 47.93 |
CNN | 35 | EN | 26,043 | 6178.54 | 189.71 | 32.57 | 30.11 | 20.97 | 1.44 | 0.49 | 11.05 |
Guardian | 35 | EN | 5,848 | 1118.86 | 80.69 | 13.87 | 25.26 | 21.97 | 1.15 | 2.26 | 27.23 |
By news source language:
Input Docs | Ground-Truth | Compression | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Lang | #Topics | #Docs | Avg #sents | Avg #dates | Avg sents/dates | Avg #sents | Avg #dates | Avg sents/dates | Sents | Dates |
PT | 143 | 68,508 | 1372.69 | 110.23 | 12.45 | 88.86 | 48.91 | 1.82 | 6.47 | 44.37 |
EN | 35 | 31,891 | 3648.70 | 135.20 | 26.99 | 27.69 | 21.47 | 1.29 | 0.76 | 15.89 |
Distribution of topics by type:
Type | PT | EN |
---|---|---|
PER | 17 | 3 |
ORG | 33 | 6 |
LOC | 82 | 25 |
KW | 11 | 1 |
WordClouds EN/PT:
The TLS-Covid19 allows one to see the evolution of a topic over time and to compare what is being said about a certain topic by different news outlets.
One can also look at keywords, part-of-speech tags, entities or events to see how things have changed over time.
As is common with most of the datasets of this kind, one can also look at collocates. A few examples might be: keywords that were common in the same time-period, words that appear near covid-19 in different time-periods, entites, events, nouns or verbs that were more common at the beginning of the pandemics than in December 2020.
Finally, one can also create a sub-set of the dataset based on the publication date, the source, the country, etc.
Pasquali, A., Campos, R., Ribeiro, A., Santana, B., Jorge, A., and Jatowt, A. (2021). TLS-Covid19: A New Annotated Corpus for Timeline Summarization. In: Hiemstra D., Moens M-F., Mothe J., Perego R., Potthast M., Sebastiani F. (eds), Advances in Information Retrieval. ECIR'21 (Lucca, Italy. March 28 - April 1). Lecture Notes in Computer Science, vol 12656, pp. 497 - 512. ECIR21 presentation
For further information related to the TLS-Covid19 dataset please contact Alexandre Ribeiro ([email protected]), Arian Pasquali ([email protected]), or Ricardo Campos ([email protected]).