Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commons same as link : the case of french #719

Open
datalogism opened this issue Nov 8, 2021 · 12 comments
Open

Commons same as link : the case of french #719

datalogism opened this issue Nov 8, 2021 · 12 comments

Comments

@datalogism
Copy link
Member

Issue validity

As explained here : https://forum.dbpedia.org/t/commons-ressources-extractor-problem/1485
I got an issues concerning the commons links from a wikipedia page in French.

Error Description

Please state the nature of your technical emergency:
data artifact is empty on the Databus : https://databus.dbpedia.org/dbpedia/generic/commons-sameas-links/2021.09.01/commons-sameas-links_lang=fr.ttl.bz2

Pinpointing the source of the error

What i did for the moment :

  • i created a shacl test (that i didn't pushed for the moment)
  • i fixed it by overwritten locally the related extractor

My questions :

  • the {{Autres projets}} template is "in langage" (called {{Sister projects}} in en) is it existing a way to exploit a dictionary for making to fix "multi-lingual" ? Or must i have to add a conditional branch in it (checking for language and string equivalency) ?
  • What is the best way if i want to exploit all the "Sisters projects" items ? (wikitionnary/wikinews...)

Thank you by advance !

@Vehnem
Copy link
Collaborator

Vehnem commented Nov 29, 2021

  1. How can i SPARQLing a request for knowing if only the french chapter is the only one touched by this trouble?
    Fixing that problem seems to be sensitive :

Concerning https://databus.dbpedia.org/dbpedia/generic/commons-sameas-links/ these links are only extracted for a bunch of languages all files seem quite small. So it is possible that french is not the only affected language.

  1. I have the feeling this practice prevents to trigger the Commons mapping. Isn’t it ?

yes, maybe it is enough to adapt the mappings?

  1. The {{Autres projets}} template aggregate more than just the “commons” links and could led to also extract wiktionnary links, wiki quotes… Have i to consider to just fixing it by a integrating a specific test case depending of the language in the given extractor 1 file ? Or must have to consider the development of a new extractor for the commons links coming from this template ? Option 2 seems better, because of the potential other data that i can grab through that way, but i prefer to getting your expertise on that question !

At first, you should write a minidump test. (you already did, maybe you can do a PR)

  1. When i will have fixed it, what is the best way for pushing it?

Create a pull request to the dev branch

@Vehnem
Copy link
Collaborator

Vehnem commented Nov 29, 2021

the {{Autres projets}} template is "in langage" (called {{Sister projects}} in en) is it existing a way to exploit a dictionary for making to fix "multi-lingual" ? Or must i have to add a conditional branch in it (checking for language and string equivalency) ?

iirc no dict yet, but should not be necessary if the extractor utilizes the mappings correctly

What is the best way if i want to exploit all the "Sisters projects" items ? (wikitionnary/wikinews...)

I think downloading the specific wikipage dump and do a grep is the easiest option

https://dumps.wikimedia.org/

@datalogism
Copy link
Member Author

Hello @Vehnem and thank you so much for your answers.
I took my time to answer you because I still wonder about the Infobox extraction process.


In the French chapter, only a little subset of the declared mapping (http://mappings.dbpedia.org/index.php/Mapping_fr) are named with the pattern "Infobox", in fact, some of these are about insert boxes that are not necessarily an "Infobox" because they could be placed at the end of the Wikipedia article as the following template: https://en.wikipedia.org/wiki/Template:Authority_control

However, few examples as the "ChimieBox" (ChemBox in English), or the "Taxobox" are a kind of Infobox, even if they don't have "InfoBox" in their names.

  1. Is the "Infobox" pattern required tested for example by a regexp for shifting an extraction through the DIEF ? Are the templates mapping declared in xml file shifted for properties data extraction as for (Authority_control) example?

I investigated this question by using the minidump process on some example of Wikipedia pages that use these templates (https://github.com/datalogism/DBpediaExperiments/blob/main/MappingInfoBoxAnalysis.ipynb)

Following the up-to-date mapping, the ChimieBox is supposed to get us some data : https://github.com/dbpedia/extraction-framework/blob/master/mappings/Mapping_fr.xml. But :

  1. Are something more in the mini-dump processing that explains these differences of extraction : "infoboxes like templates that not following infoboxes naming convention" VS "infoboxes declared as infoboxes template" ?

I also remark a case: https://en.wikipedia.org/wiki/Football_at_the_2012_Summer_Olympics_%E2%80%93_Men's_tournament_%E2%80%93_Final that use two templates: the "Infobox football match" as infobox and the "Football box" an included properties rescribing in more details the football event.

-> Only the data from the "Infobox football match" template are returned data
3. Why the second template (the property template) didn't return data by the minidump process?

--
4.By reading myself once again before sending you my message, i have the following intuition : it may be because of the config file managing which extractor to use for the minidump process : https://github.com/dbpedia/extraction-framework/blob/master/dump/src/test/resources/extraction-configs/generic-spark.extraction.minidump.properties
It seems to use the same extractors as the "global" extraction config (https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.spark.properties) but is it really the case ? Is it existing an extractor that just extracts the "property template" mapped ?

Now for coming back on my original question :
Is it preferred to develop a special properties extractor for the {{Sisters projects}} templates properties? Or is it better to include it in a mapping ? And if i do, will it be extracted ?

I am sorry for all these questions, but as newbies, I must be sure of the process and how this one is processing these kinds of data before being able to help the community in the best way.

@JJ-Author
Copy link
Contributor

JJ-Author commented Dec 13, 2021

@datalogism can we just go a step back, and you say what you actually would like to extract, so what kind of triples do you want? If it is e.g. only about commons-sameAs links or "authority template links" I wonder whether it would be best to rely on the wikidata extraction instead?

https://databus.dbpedia.org/dbpedia/wikidata/sameas-all-wikis/
https://databus.dbpedia.org/dbpedia/wikidata/sameas-external/

@datalogism
Copy link
Member Author

datalogism commented Dec 13, 2021

@JJ-Author, at the base i wanted to get the commons-same-as links, and you right for solving this initial goal your proposed fix is sufficient.
But i also understood that i could also get all the links attributed to the "Sister projects" template, i am thinking about the Wiktionary links for example.

This road led me to the questions about the infobox extraction via the mappings that i exposed you above.
Because at the first sight two ways could be possible for extracting it :

  • via a dedicated extractor (the initial question ask in this thread)
  • or by the mapping way (if it is possible to extract by this manner properties objects - focus of my second message)

@JJ-Author
Copy link
Contributor

JJ-Author commented Dec 13, 2021

As I understood the idea of mappings extraction is to create mappings of infobox parameters to the dbpedia ontology. The idea is here that these infoboxes represent a more or less standardized information for a subset entities of the same type. You are right infoboxes are only templates so in theory it could work to define an "infobox" mapping for sister projects. but the template seems more like a generic template that is valid for all types of wikipedia articles (hence i see it more in the generic extraction)
-->
so my personal intuition would say that a dedicated extractor for it seems the right choice, because these sister projects are not directly tied to the entity but to the page article itself. So if you extract the triples via a mapping these would end up in the mappingbased-objects artifact and I personally think that they are not right there.

with regard to the detailed questions about minidump @Vehnem will write you later

@datalogism
Copy link
Member Author

thank you @JJ-Author ! Your arguments are going in the same direction than my first understanding of the infobox, i wanted to be sure of the design philosophy because the mapping files analysis shows me that properties were mapped, as the cited authority control exemple : http://mappings.dbpedia.org/index.php/Mapping_en:Authority_control.

Question : Could these kind of out-of-philosophy mapping affect/alterate the typing given to a entity ?

looking forward the @Vehnem feedback !

@jlareck
Copy link
Collaborator

jlareck commented Dec 14, 2021

Now for coming back on my original question :
Is it preferred to develop a special properties extractor for the {{Sisters projects}} templates properties? Or is it better to include it in a mapping ? And if i do, will it be extracted ?

@datalogism I think we should firstly look at existed extractors, maybe some of them have similar logic that we can reuse and achieve what you want. But before checking the extractors we also need to have a clear example of what should be the input and the output from it. So, here are the next things that will help us to solve this issue:

  1. Send link to some page with this {{Sister projects}} template or {{Autres projets}}
  2. For that page send please expected extracted triples from the {{Sister projects}} (or {{Autres projets}}) template.
    Below I will show you an example:

So for example we have page https://en.wikipedia.org/wiki/Borysthenia_goldfussiana . And it contains infobox:

{{Taxobox
| name = ''Borysthenia goldfussiana''
| image =
| image_caption =
| status =
| regnum = [[Animal]]ia
| phylum = [[Mollusca]]
...

And InfoboxExtractor (I guess InfoboxExtractor produced them but maybe some another could also produce those triples) produce triples like these (let's also assume that they are also expected extracted triples):

<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/name> "Borysthenia goldfussiana"@en .
<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/regnum> "Animalia"@en .
<http://dbpedia.org/resource/Borysthenia_goldfussiana> <http://dbpedia.org/property/phylum> <http://dbpedia.org/resource/Mollusca> .

So in similar way please describe what data from some concrete page must be produced. It would be very helpful to know what should be as a subject, predicate, and object.
Thank you

@jlareck jlareck removed the question label Dec 14, 2021
@datalogism
Copy link
Member Author

Hello @jlareck !

Concerning the Sister projects templates question, almost every French articles have some.
We miss the "common same as" triples because as develop the current extractor stand on the use of the {{commons}} template, never used alone in French Wikipedia.

Let's take this exemple :
https://fr.wikipedia.org/wiki/Berlin contains the following template at the end of the article :

{{Autres projets
| commons=Category:Berlin
| wiktionary=Berlin
| wikinews=Catégorie:Berlin
| wikivoyage=Berlin
}}

In term of triples we could imagine something like that using owl:SameAs prop, but we could also imagine to create special property for describing it in the ontology (on the example of WikiPageInterLanguageLink prop we could have property called WiktionaryLink) :

<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs>  <http://commons.dbpedia.org/resource/Berlin>  .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs>  <https://fr.wiktionary.org/wiki/Berlin>  .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikinews.org/wiki/Cat%C3%A9gorie:Berlin>  . 
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikivoyage.org/wiki/Berlin> .

For the moment only an extractor for the common exist : https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/CommonsResourceExtractor.scala

If we project to integrate the Wiktionnary links, and other Wiki we have thing about how to shape it :

  • creating a global extractor able to deal with the {{ Sister projects }} template
  • creating one by on an extractor for each other Wiki portals.

@jlareck
Copy link
Collaborator

jlareck commented Dec 19, 2021

@datalogism okay, so, for {{Autres projets}} we can create a new extractor and as initial point for creating it can be InfoboxExtractor. it is already extracts the data from this template and produce for example triples like these:

<http://fr.dbpedia.org/resource/Antoine_Meillet> <http://fr.dbpedia.org/property/wikisource> "Antoine Meillet"@fr .
<http://fr.dbpedia.org/resource/Antoine_Meillet> <http://fr.dbpedia.org/property/commons> "Category:Antoine Meillet"@fr .

from

{{Autres projets
|wikisource = Antoine Meillet
|commons = Category:Antoine Meillet
}}

So, we can take as a base InfoboxExtractor, modify some parts and produce neccessary triples from this templete.

This is an example of {{Sister project links}} template

{{Sister project links|Angela Merkel|wikt=Merkozy|s=Author:Angela Merkel|display=Angela Merkel}}

As I see, {{Sister project links}} has a different structure and we need to think more about how to handle it. Here I guess we need to use mappings from the properties like s, wikt to some other properties.

And it looks like that some parts of this template we need to skip (e.g. |Angela Merkel| and display=Angela Merkel) during the extraction, am I right?

@jlareck
Copy link
Collaborator

jlareck commented Dec 22, 2021

Hi, @datalogism, I have implemented a draft extractor for {{Autres projets}}. You can have a look at it, maybe something can be helpful for you: https://github.com/dbpedia/extraction-framework/blob/a9ed5f0396c82854c8e1663d87571a0935c444ab/core/src/main/scala/org/dbpedia/extraction/mappings/AutresProjectExtractor.scala . It produces triples like:

<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <http://fr.commons.dbpedia.org/resource/Category:Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wiktionary.org/wiki/Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikinews.org/wiki/Catégorie:Berlin> .
<http://fr.dbpedia.org/resource/Berlin> <http://www.w3.org/2002/07/owl#sameAs> <https://fr.wikivoyage.org/wiki/Berlin> .

You can execute minidump tests and see those triples in the infobox-properties dataset (I reused dataset configuration from InfoboxExtractor for this draft implementation of the AutresProjetExtractor).

@datalogism
Copy link
Member Author

As I see, {{Sister project links}} has a different structure and we need to think more about how to handle it. Here I guess we need to use mappings from the properties like s, wikt to some other properties.

I didn't thought about this template, you got it. This one is based on a Lua script defined here : https://en.wikipedia.org/wiki/Module:Sister_project_links.
Based on this we could easily adapt in case of a extractor.

This script underline for me two kind of link : the one that we can easily find via a search (generally via the name of the article), and the other that are not obvious : Merkozy is here a good exemple ! And give to the extraction a real added value

I have implemented a draft extractor for {{Autres projets}}. You can have a look at it, maybe something can be helpful for you

Thank you again, @JJ-Author, @Vehnem, @jlareck for you help and support !
I will test this brand-new extractor next days and giving you my feed back !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants