Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese: discussions regarding indexing of specific expressions #167

Open
makorin0315 opened this issue Sep 9, 2021 · 2 comments
Open
Assignees

Comments

@makorin0315
Copy link
Collaborator

makorin0315 commented Sep 9, 2021

In discussions with Dr. Torikai & Dr. Noguchi (@Rei_hub), we've considered a few expressions whose indexing results can be improved. In some of the topics below, we have not reached a conclusion on "what should be the correct implementation to achieve desired outcome". I will update this issue as 1) new observations are made upon experiments; and 2) agreed approach is implemented, on each topic.

a. とか vs. など in Negation expressions

  • とか is a verbal expression but often included in informal text such as doctor's progress notes (as opposed to written expression など, which is more formal)
  • it would be preferable if they get indexed the same way, especially in expressions such as 胸痛とかはない. In this example, とか would be NonRelevant and は would be PathRelevant.
  • after some investigations, it has been determined that this can be achieved for instances that include とかは and とかが
  • we need to extend the Negation span to the Concept before とか in both cases

b. などの and 等の

  • in expressions such as 胸痛などの症状 and 胸痛などの重要所見, it would be preferable to have the entire expression as a long Concept
  • this is doable but not appropriate if particle や precedes the Concept before などの - in such cases, などの should still be a Relation
  • same implementation should be exercised for 等の
  • implementation of this would likely be a relatively large effort.

c. Sahen+SuruVerb+ことはない vs. Sahen+SuruVerb+ことは無い

  • currently, 失神したことはない and 失神したことは無い are indexed differently. they should be identical. Current indexing results for the phrase 失神したことはない is considered correct = Relation+Negation+Sahen
  • we've also discussed making 失神したこと as a Concept, but when there's an object for the Sahen verb, it looks better as a Relation. Example: 企業と個人の取引に適用したことはない。これまで天皇の即位前に新元号を公表したことはない。
  • one approach maybe to make Sahen+SuruVerb+こと a Concept if Sahen is not preceded by object-indicating particles such as を or に. This approach needs further experimentation.

d. Concept1+や+Concept2+の+Concept3

(new on October 5, 2021)

  • currently, 息切れや胸痛の症状 gets indexed as 3 separate Concepts, with や being PathRelevant and の being a Relation, whereas 胸痛の症状 is a single Concept. This is specifically done because 息切れ is being used parallelly to 胸痛. @Rei-hub and I discussed exploring the possibility of making both や and の as PathRelevant in such case to allow identification of such parallelism. Thorough experiment is required, but it may be a possibility to keep track of a group of entities within a sentence.
@makorin0315 makorin0315 self-assigned this Sep 9, 2021
@Rei-hub
Copy link

Rei-hub commented Sep 16, 2021

Thank you for summarizing and organizing the main points.

  • About the first point, Negation expressions should work in the same way regardless of whether there is "とか"/"など" or not. As you described, it would be preferable that "とか"/"など" is NonRelevant and "は" is PathRelevant. Extending the Negation span to the Concept may be possible by a simple logic based on a partial match of "とか"/"など".

  • About the second point, the Concept identification should work in a consistent manner regardless of whether there is "などの"/"等の" or not. Based on this consistent rule, if there is a parallel expression combined by a particle such as "や", as you pointed out, it should be split before "などの" as a Concept.
    e.g.)
    息切れや胸痛の症状 → 息切れ + 胸痛 + 症状
    息切れや胸痛などの症状 → 息切れ + 胸痛 + 症状
    (The Concept should be identified in the same way with or without "などの".)

  • Regarding the final point, "失神する" and "失神したことがある” could have a slight difference in nuance. The former means current symptom onset, whereas the latter includes the nuance of previous medical history. It is unclear at this point how much of a difference distinguishing the two words will make in my research, but at least, the Concept should be identified in the same way regardless of whether "ない" (Hiragana) or "無い" (Kanji).
    In terms of the Negation span, Your suggestion, making "Sahen+SuruVerb+こと" make Sahen+SuruVerb+こと a Concept, seems to be a realistic solution and to work well for my research.

In any case, you will need many experiments and validations in real medical texts, and I am happy to help and support you.

@makorin0315
Copy link
Collaborator Author

For ease of reference, I've edited the original post so that each item is now part of the ordered list "a, b, c, etc."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants