🌍 arinaraileanu.github.io/What-Do-Parties-Talk-About
The Quotebank dataset provides 178 million quotes gathered from English news articles published throughout the years. We aim to assess whether there is a link between the main topics addressed by political figures and their corresponding parties. If a link is found, it could be an additional datapoint for voters to evaluate a party, which is why we consider this research to be relevant. For the purpose of our analysis, we use a sequence of filters on Quotebank (to get only politicians) and get more insights about politicians using Wikidata. Besides data statistics and speakers’ words analysis we employ two BERTopic models to develop a topic analysis per year and a topic analysis over time.
Using our current and future analysis, we answered a sequence of questions and gained insights into the relationship between the most common topics addressed in quotes and the parties their speakers belong to:
- Can politicians' quotes be clustered by subject?
- What are the most common subjects tackled by politicians?
- Do the main subjects change with time?
Given that we are interested in the political orientation of our speakers, we decided to enrich our data with the additional metadata about the
speakers, provided through the speaker_attributes.parquet
file.
- We first have to map the Q-code attributes to their corresponding labels. Initially, we attempted to use the Wikidata API to aggregate all the aliases and the label for each Q-code.
However, this procedure was slow, so we later decided to use the provided
wikidata_labels_descriptions_quotebank.csv.bz2
file for the mapping. - We filter the Wikidata entries to keep only the Q-codes of the speakers with politician as one of their occupations and who belong to at least one party.
- We inner join the Quotebank and the Wikidata entries based on the speaker's Q-codes, such that every row would now contain additional labelled information about the speaker. This results in a dataframe only containing politician quotes.
- If a
qid
field in Quotebank does not match with any of theid
values in Wikidata, we have observed that this happens because the Quotebank Q-code is not the most recent one. We simply drop all these rows, as they do not have a correspondence in Wikidata. - To attribute a party to a quote, we engineered a dataset where we map each unique speaker Q-code to a list of tuples -
<party_name>, <start_date>, <end_date>
. We are retrieving the dates using Wikidata'sqwikidata
API. - If a speaker belonged to only one political party throughout their life, we associate that party with their quotes, regardless of the date.
- If a speaker belonged to multiple parties, we aim to select the party they were part of at the time of the quotation.
For the current stage of the project, we decided to analyze data from 2018. In our final analysis, we will include data for all the available years. Our research goals are all politically-related. Thus, we use the subset of quotes belonging to politicians, extracted with the methods described above in the additional datasets section.
To ensure a higher confidence in our analysis, we filtered out rows where the probability of the speaker is lower than 0.6. Furthermore, we remove all the rows which have multiple Q-codes associated to their speaker, as it is difficult to distinguish who actually uttered the quote.
To investigate this, we create sentence embeddings using the BERTopic. We used two models:
- Fit and transform quotes from all parties per year to vizualise the topic analysis per year
- Fit and transform quotes per party thorugh the years to vizualise topic analysis over time
We further explored the most common words as entities and parts of speach for the two most quoted parties - Republican and Democratic. Moreover, we predicted the US elections results in each state based on the state mentions by each party members.
As a crude example, we can investigate the occurrences of the topic word in the quotations by a politician. We pre-process the data using NLTK to identify commonly used nouns within speaker quotations. As an example, we empirically observe differences between nouns for Trump and Obama, which are discussed in the notebook. The table below represents the 10 most common nouns of each speaker.
Donald Trump | Barack Obama |
---|---|
people | people |
country | time |
president | president |
time | world |
Trump | country |
states | politics |
lot | years |
deal | democracy |
world | things |
things | work |
Using the initial results of the first question, we conduct additional analysis on the quotes contained within topic clusters. We keep track of the political party of the members within each topic cluster and output the number of quotes each party is associated with for a given cluster.
Not only do politicians switch parties from time to time, but even opinions within the same party can fluctuate. Our first approach at monitoring quote trends involves counting the quotes containing a certain keyword by date and relating them to Google trends, to figure out whether they follow a similar pattern. The plot below suggests a link between the occurrence of "shooting" in Donald Trump's quotes and the search trend of the word. Furthermore, both can be related to mass shootings that happened in the US, such as the El Paso shooting, which corresponds to the highest spike in both trends.
While examining approaches for answering our third research question, we discovered an important aspect of the dataset. Quotes may be referenced in news websites at a different time than originally spoken. Furthermore, news quotes may not be a direct representation of what politicians actually talk about. By definition, quotes are cherry-picked by authors that might be biased. For example, some websites are owned by politicians and controversial subjects are over-emphasised in the media.
- 15.11.21 Integration of datasets for all years into current analysis.
- 19.11.21 Re-do clustering based on topics of interests.
- 22.11.21 Pause project work.
- 26.11.21 Homework 2 deadline
- 28.11.21 Perform final analyses.
- 08.12.21 Begin developing a rough draft of the datastory.
- 11.12.21 Complete all code implementations and visualisations relevant to analysis.
- 14.12.21 Complete datastory.
- 17.12.21 Milestone 3 deadline
Tasks | |
---|---|
@arinaraileanu | Election prediction Web development Come up with meaningful visualizations Continue exploring the dataset Develop the final text for the data story |
@nitu-catalin1998 | Graph analysis over topics and parties. Common words analysis Develop the web interface Analyze news website bias Develop the final text for the data story |
@MihaiDavid05 | Graph analysis over topics and parties. Topics analysis over time Define topic of interests Tune clustering Develop the final text for the data story |
@umerhasan17 | Topics analysis over year Develop the web interface Integrate datasets of all years Develop the final text for the data story |