Skip to content
Kuang Lu edited this page Jun 10, 2018 · 2 revisions

Table of Contents

Background Linking

Definition

After some thoughts and observation, I define a background article as an expansion of some part of the original query article. This the essential basis of all my experiments. It might not be the right or complete definition. However, I think it is reasonable. You may find it is incomplete or possible wrong in the future. You could then refine it and improve my current method accordingly.

Method and Result Analysis

According to this definition, my initial try is to use every paragraph to retrieve articles. I have tried two ways: using all terms or all entities. I choose to only use the first two results for each paragraph since I think the number of background articles of one paragraph should be small since a news organization is not likely to repeat the same thing twice.
The two different methods gave me similar results and I think using all terms is slightly better. Since the paragraph is long and can be noisy, I want to further refine it. Therefore, I tried to extract keywords from each paragraph using the probability P(d|w), which is the probability of generating the paragraph given the word. I arbitrarily chose to use the top 5 keywords. Initially, I also want to assign weights to each keyword based on the aforementioned probability, but since some words are too rare and the weights can be extremely skewed (e.g. only one word is 1 and others are 0). Therefore, I given each word the same weight. I also use the top two results for each paragraph and use 5 as the score threshold. 5 is arbitrarily chosen and you can use more sophisticated methods. The final results seem to be improved.

However, since my method is pretty basic, I think there are many parts of it that can be improved:

  1. how many keywords should be used
  2. could we use a better weight schema for the keywords
  3. how to merge the results among paragraphs to generate a rank for the article
  4. can we have better method for keyword extraction.

I have some thoughts about the last one. After examining the results using keywords as queries, I found some problems. For example, in this example query article, the first paragraph is

LONDON — In April, British Prime Minister Theresa May called for a snap general election to be held on June 8

When using all words of a paragraph, I could find this background article. However, when using keywords queries, I missed out it. When I look at the keyword query, it is prime london theresa british minister. It is not surprising that these 5 words are extracted as keywords since they co-occur very often. However, they do not include all keywords we need. The word election is missing. Therefore, Dr. Fang suggests that we try to use the whole article to generate a query, which in this case should be about british election. As a result, the background article can be retrieved.

Another problem of the keywords I extracted is that they favor rare terms. If there are multiple rare terms in a paragraph, since they are rare, the probability of them co-occurring is high. As a result, they would boost the P(d|w) for each other, which lead to they all be chosen as keywords. One solution could be to set a limit for document frequency. If a term's document frequency is lower than this limit, it is not considered.

I have an idea to solve these problems. That is to match paragraph with potential background articles with two dimensions: entities and the words around the entities. News reports often focus on five aspects: who when what where why. Entities can match who and where, while the words around the entities can match what. We do not need to match when since if the other aspects are matched, it should be a background article. Also, we do not need to match why as that may certainly not affect whether an article is background or not. This two-dimension matching may potentially solve the two problems I mentioned above as well. In the example of the first problem, the keywords selected are all part of an entity whereas the missing word election is a word surrounding the entity. For the second problem, the rare but co-occurring often terms are usually entities, such as the first and last name of a person.

As in our last discussion, I mentioned the work about matching entities as well as their language models of a graduated student of our lab. You can read his paper. I will also forward the email of me asking his code so that you can find his code. I think he's code could be a good starting point if we want to do the two-dimension matching.

Other Thoughts

We could use the exact match of images (e.g. match the url of images) to find background articles. Since the images in an article can be about some background information instead of about the article itself.

Entity Ranking

Since I have been focusing on background linking, I have not spent much time on it. But one thing we realize is that there are different types of the entities and they may require different techniques to recognize that they are background entities. For example, the British election query article have "Tory", the nickname of a British political party, as a background entity. However, this example query article have San Antonio, the location where the incident occur, as a background article. I feel we need to have different methods for these two entities, and I especially think location entities need to be specially treated.

Although the linkings from the entities in the query articles to a Wikipedia dump will be provided, I do not think that they need to be used. The reason is that how the entities are described in its Wikipedia page can be dramatically different from how it is used in a news article.

Clone this wiki locally