Modelling of POS and morphological features #871

gcelano · 2024-11-13T12:18:48Z

gcelano
Nov 13, 2024

In my corpus, POS + morphological features are encoded as a 9-character string, with the first character being the POS and the others the morphological features. If I encode this string as it is in PAULA, and then write a ANNIS query such as postag=/v.*/, this takes a huge amount of time to be processed. I guess that the best way to encode it is to separate the POS character from the others. Should one also consider to split all other values using multiFeat? How do you usually model morphological features in PAULA?

thomaskrause · 2024-11-15T11:46:14Z

thomaskrause
Nov 15, 2024
Maintainer

I think is indeed best to split this up as much as possible. PAULA is agnostic to how to model morphological features, but the further processing is as you've seen dependent on the complexity of the annotation value. In general, regular expressions with prefixes work better than having a .* at the beginning, because ANNIS can narrow the search based on the prefix. Than all matches with the prefix have to be matched against the regular expression. Also, if there are fewer values possible in a tagset, the regular expression can be evaluated more easily and less storage is needed in the main memory implementation. If you use a combined morphological value that represents several dimensions, the number of combinations are larger and more distinct values have to be stored and searched.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modelling of POS and morphological features #871

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Modelling of POS and morphological features #871

gcelano Nov 13, 2024

Replies: 1 comment

thomaskrause Nov 15, 2024 Maintainer

gcelano
Nov 13, 2024

thomaskrause
Nov 15, 2024
Maintainer