Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpora with token which have the same left or right text boundaries can't be imported #309

Closed
thomaskrause opened this issue Apr 9, 2014 · 0 comments
Assignees
Labels
Milestone

Comments

@thomaskrause
Copy link
Member

If the node.tab contains token that have the same left or right text boundary an SQL error is thrown that the sub expressions returns more than one row.

The error is in the file left_token_right_token.sql:

UPDATE _node AS parent SET 
left_token = (
  SELECT token_index FROM _node AS child 
  WHERE 
    parent.left = child.left 
    AND parent.corpus_ref = child.corpus_ref 
    AND parent.text_ref = child.text_ref 
    AND child.token_index IS NOT NULL
), 
right_token = (
  SELECT token_index FROM _node AS child 
  WHERE 
    parent.right = child.right 
    AND parent.corpus_ref = child.corpus_ref 
    AND parent.text_ref = child.text_ref 
    AND child.token_index IS NOT NULL
);

It should be changed to return the minimum or maximum token_index.

From the ANNIS datamodel these data is not really valid. Unfortunately some corpora have these kind of errors which where undetected because older versions of ANNIS did ignore this problem. We can't fix all corpora, so we should make the import more stable regarding this error.

thomaskrause added a commit that referenced this issue Apr 9, 2014
@thomaskrause thomaskrause added this to the 3.1.2 milestone Apr 9, 2014
@thomaskrause thomaskrause self-assigned this Apr 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant