Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automate generation of @en-tagged versions of annotations #840

Closed
rjyounes opened this issue May 11, 2023 · 5 comments
Closed

Automate generation of @en-tagged versions of annotations #840

rjyounes opened this issue May 11, 2023 · 5 comments

Comments

@rjyounes
Copy link
Collaborator

rjyounes commented May 11, 2023

See issue #685.

This will go into a separate file, so the main files have canonical, untagged versions.
This will be generated as part of the pre-commit hook so that they are up-to-date for development purposes.

Assigning to @Jamie-SA, to be delegated if desired.

@rjyounes
Copy link
Collaborator Author

rjyounes commented May 11, 2023

TBD: File naming conventions, file and directory structure

  • One directory for each language or a shared directory for all languages?
  • One file per language, or mirroring the canonical files with 5 separate files? If the latter, there should be a directory per language to avoid clutter.
  • The release package generates multiple serializations of each file, so having one directory per sub-language seems cleaner and more manageable.
  • Adding status "under review" until these questions are answered.

Also TBD whether to use 2- or 4-digit tags. See discussion thread on issue #685.

@rjyounes rjyounes added the status: under review In triage label May 11, 2023
@rjyounes rjyounes added impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) and removed impact: major Non-backward compatible (changes inferences; e.g., adding a restriction, domain, range) labels May 11, 2023
@rjyounes
Copy link
Collaborator Author

rjyounes commented May 16, 2023

Not sure if this helps, but ontology-toolkit provides an example of how this could be done using onto_tool(though the example query replaces the non-tagged version rather than adding to it).

SPARQL tools apply a SPARQL Update query to each input file and serialize the resulting graph into the output file. RDF format is preserved unless overridden with the format option. If the query is specified inline, template substitution will be applied to it, so bundle variables can be used, but double braces ({{ instead of {, }} instead of }) have to be used to escape actual braces.
  - name: "add-language-en"
    type: "sparql"
    query: >
      prefix skos: <http://www.w3.org/2004/02/skos/core#>
      DELETE {{
        ?subject skos:prefLabel ?nolang .
      }}
      INSERT {{
        ?subject skos:prefLabel ?withlang
      }}
      where {{
        ?subject skos:prefLabel ?nolang .
        FILTER(lang(?nolang) = '')
        BIND(STRLANG(?nolang, '{lang}') as ?withlang)
      }}

@rjyounes
Copy link
Collaborator Author

rjyounes commented May 27, 2023

Team discussion on 2023-05-25 led to the following conclusions:

  * Related issues: 
	  * [Automated generation of en version](https://github.com/semanticarts/gist/issues/840)
	  * [Sample French version](https://github.com/semanticarts/gist/issues/841)
	  * [Sample Spanish version](https://github.com/semanticarts/gist/issues/851) (Pat McBennett to submit PR)
	  * [Documentation](https://github.com/semanticarts/gist/issues/842)
* Open discussion points
    * Language or language+subtag? Subtag could be region or script. E.g., in Chinese the script is of primary significance in written text, not the region. From John Cowan.
      * Maybe language only if there are no significant regional distinctions, otherwise language+region? (Note, for accuracy the subtags are not always 2 letters). 
      * Don't need to prescribe, just let people submit what they want to? We could reject a region tag where there are no regional distinctions.
      * What do we want to do for our versions?
	      * English - use language only, may change in future
	      * French - use language only, may change in future
     * How do we review PRs in language we have no expertise in? 
         * Trust factor
         * In-house expertise
	     * Test in Google translate (to English), if they look OK accept them. If not, reject the whole batch.
	     * Add disclaimer for those we haven't been able to review, and also that they may not be up-to-date as gist evolves.
	     * SA in-house expertise:
		     * French: Doug, Jess
		     * German: Rebecca
		     * Arabic: Dalia
		     * Simplified and Traditional Chinese: Katie (some)
		     * Japanese: Katie (some)
		     * Russian: Irina, Boris
		     * Hebrew: Boris (some)
		     * Italian: Irina
    * No default untagged version - our default should be `en-US`. From John Cowan.
      * Then this is a major change.
      * Decision: keep untagged versions as defaults in ontology files. Can change later in a major release. 
    *  File names and directory structure. 
	    * Only release in Turtle. 
	    * Most likely only gistCore, not supplementary ontologies, will be translated.
	    * There might be a translation of the readme if someone wants to do it.
	    * So, we'll only have one ontology file per language. Naming conventions:
		    * `gistCore.en.ttl`
		    * `README.en.md`
	    * So use one directory. [We didn't decide on a directory name, I propose `language_tagged_annotations` or similar.]

@rjyounes
Copy link
Collaborator Author

rjyounes commented Sep 28, 2023

Discussion of whether to put English-tagged versions in a separate file, with a default untagged version, or tag the current version with English tags. If we have a default untagged version, this is the one that would be edited.

The latter breaks backward compatibility.

Existing URL - untagged - if we have one. Default language-tagged version would be gistCore.en.ttl.

We decided previously to use only language, not language + region.

Non-English-tagged versions will go in separate.

Boris:

  • Put all tagged versions in the core file, not separate files - this introduces maintenance problem
  • Next minor release create two versions:
    • (1) only English labels without tags (for backward compatibility)
    • (2) full file, all languages. This will retain backward compatibility.
  • On major release, we flip it - the internationalized version becomes the default, and the default English-tagged only becomes the non-default and deprecated.

Rebecca: Keeping everything in core file means whenever there's a commit, you have to get a review from a language expert.

Boris: Have a completely separate ontology for each language, containing only annotations, with a dependency on a certain version of gist.

Rebecca: overhead too high, let's not do it. We don't have a large enough team to support internationalization. Some firms subcontract out internationalization.

Peter: Use content negotiation to get different language versions.
Minimally tag everything with @en, rather than having an untagged version. That gets our foot in the door of internationalization, and we can see how far we want to go in future.

Jamie: Keeping in separate files will decrease the burden.

Rebecca: Then we have a maintenance issue.

Rebecca: Internationalized versions have to be very precise to handle the level of precision we put into our annotations.

Boris: Have a gradual on-ramp. Difficult verbiage is in definitions. So maybe first step is to just do pref labels. Do first major version with labels only, and see how much of a burden it is to maintain them before moving forward.
If we are concerned about maintaining a separate file, have explicit version dependencies, merge during release process if the versions are compatible.
Use SHACL to require English tags, and no more than one prefLabel and definition per language.

Rebecca: How many users do we really have that don't know good English? We should acknowledge that English is the lingua franca of the corporate world, finance, business, science, academia, IT, engineering...any domain we are likely to enter.

Peter: Have English tags, don't do others if the maintenance is too high.

If you have multiple tags, you have to query as:
FILTER(langMatches(?label, 'en'))
SELECT ?v WHERE { ?v ?p "cat"@en } - exact string match

@rjyounes
Copy link
Collaborator Author

rjyounes commented Oct 12, 2023

Rebecca - Do nothing

Mark - Don't put language tags in the primary gist file.

Questions:

  1. Do we add any language tags?
  2. If yes, do we offer a non-tagged version?
  3. If we are adding tags, do they go in the main file or ancillary files?

DECISION:

  • Do nothing until we have a business case.
    • Disrupts SPARQL queries
    • Large maintenance issue
    • No current demand from gist users
  • If/when a need arises, we will create the infrastructure to do so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants