Skip to content

Maintaining and Updating Dictionaries

Ambreen H edited this page Jun 28, 2020 · 4 revisions

SPARQL query XML doesn't match with amidict XML

Tester: Ambreen H : While downloading dictionaries from SPARQL the XML document downloaded has each column as a separate element and not as an attribute within the element tag. Will this format work for ami search or is there a way to change it to the one required for ami search? I used SPARQL to get all relevant information regarding countries including abbreviations, synonyms, URL, country code etc which is not available in the country dictionary in ami. Downloaded dictionary for reference: https://github.com/petermr/openVirus/blob/master/dictionaries/test/country_wikidata.xml.xml

eg:

<?xml version='1.0' encoding='UTF-8'?>
<sparql xmlns='http://www.w3.org/2005/sparql-results#'>
	<head>
		<variable name='wikidata'/>
		<variable name='wikidataLabel'/>
		<variable name='wikipedia'/>
		<variable name='wikidataAltLabel'/>
		<variable name='synonyms'/>
	</head>
	<results>
		<result>
			<binding name='wikidata'>
				<uri>http://www.wikidata.org/entity/Q16</uri>
			</binding>
			<binding name='synonyms'>
				<literal>🇨🇦</literal>
			</binding>
			<binding name='wikipedia'>
				<uri>https://en.wikipedia.org/wiki/Canada</uri>
			</binding>
			<binding name='wikidataLabel'>
				<literal xml:lang='en'>Canada</literal>
			</binding>
			<binding name='wikidataAltLabel'>
				<literal xml:lang='en'>CA, ca, CDN, can, CAN, British North America, 🇨🇦, Dominion of Canada</literal>
			</binding>
		</result>

PMR: You can use the existing dictionary for ami search at present, but the dictionary itself has many shortcomings and needs extensive editing. See https://github.com/petermr/ami3/blob/master/src/main/resources/org/contentmine/ami/plugins/dictionary/country.xml

Because of that, and because almost all the content for country will be in Wikidata , SPARQL will give a better dictionary. I will write an amidict tool to convert the SPARQL output to amidict format.

purpose of the dictionaries are:

  1. to collect together concepts we are interested in under a single label (e.g. country)
  2. to provide an for each concept
  3. to provide search terms for each concept to locate it in the documents we search
  4. to link the concept to the world's knowledge graph.
  5. to help human readers understand the concepts.
  6. to provide a record of provenance and maintenance

this may be understood by taking the country dictionary as an example:

All the words that can potentially be present in any research paper must be well available within our country dictionary or better "all the words describing countries where viral epidemics have been reported/discussed". That can be hard ("Himalayan", "North Atlantic", "Sub-Saharan", etc.) But generally, academic papers will mention one or more countries specifically. @Emanuel Faria has done this for plants (where do essential oils come from?"). For that a country ("india") is too broad - we might want "Goa", or "Rajasthan", WE may have to be more specific "Wuhan" rather than "China". But for the moment lets work with countries.

It also has to be ensured that the country names that appear in the dictionary are really recognized countries (for instance, not ancient empires). It must also contain, in my opinion, the following:

  1. All the synonyms of the country: synonyms. Yes. "England", "Scotland", "Britain", "United Kingdom" are all widely used.
  2. All the common abbreviations: Yes. UK, GB, NI, for example. Abbreviations often cause ambiguity.
  3. Maybe even translations in other important world languages: Translations. Absolutely. If we are going to explore Hindi we will need a term.hi attribute. Wikidata has these if they are the titles of Wikipedia pages
  4. The current dictionary has empty entity-tags for Wikipedia as well as wikidata which must also be present for redirection to the source pages: Yes. The tags were autogenerated to show they should be filled by hand.

should I update the existing country dictionary manually with all missing values?

The amidict software is, in principle, able to find Wikidata and Wikipedia links. But these are often ambiguous. In cases like country I expect it will be the leading one found. Manual checking is always required. This is an excellent thing for incoming INYAS to help with.

Clone this wiki locally