Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Problematic character removal does not correct for position in source text. #3566

Open
chriskamphuis opened this issue Nov 21, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@chriskamphuis
Copy link
Contributor

chriskamphuis commented Nov 21, 2024

Describe the bug

Some problematic character are removed before creating a Sentence object:

def __remove_zero_width_characters(text: str) -> str:

When this happens, the location to the source text is not preserved. Perhaps these should only be removed after creating the tokens, such that position to source text is preserved.

To Reproduce

from flair.models import SequenceTagger
from flair.data import Sentence

tagger = SequenceTagger.load("flair/ner-english-large") 

text = "Hello my name is \u200c Chris Kamphuis, and I live in \u200c the Netherlands."
sentence = Sentence(text)
tagger.predict(sentence)
for span in sentence.get_spans():
    print(text[span.start_position:span.end_position])

produces:

 Chris Kamphui
e Netherlan

Expected behavior

It should produce:

Chris Kamphuis
the Netherlands
### Logs and Stack traces

_No response_

### Screenshots

_No response_

### Additional Context

_No response_

### Environment

#### Versions:
##### Flair
0.14.0
##### Pytorch
2.5.1+cu124
##### Transformers
4.44.2
#### GPU
False
@chriskamphuis chriskamphuis added the bug Something isn't working label Nov 21, 2024
@chriskamphuis chriskamphuis changed the title [Bug]: [Bug]: Prolematic character removal does not correct for position in source text. Nov 21, 2024
@chriskamphuis chriskamphuis changed the title [Bug]: Prolematic character removal does not correct for position in source text. [Bug]: Problematic character removal does not correct for position in source text. Nov 21, 2024
@helpmefindaname helpmefindaname self-assigned this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants