Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

Open
jonathanfox5 opened this issue Nov 19, 2024 · 1 comment
Open

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

jonathanfox5 opened this issue Nov 19, 2024 · 1 comment

Comments

@jonathanfox5
Copy link

jonathanfox5 commented Nov 19, 2024

Update

I've done some digging and this only seems to affect v3.8. Downgrading to v3.7 fixes the problem.

The only 3.8 version I've tried is 3.8.2 so I'm unsure if 3.8.0 / 3.8.1 are also affected.

Overview

When running spacy.language.Language in a script on Windows, it randomly produces a segmentation fault (the behaviour in powershell is to stop execution of script and you need to run it in bash to see the "segmentation fault" error). This error does NOT appear on macOS, even in identical environments.

There appears to be no link between the text input and the crashes since:

  • It crashes on a different sentence each time
  • It processes the same sentence perfectly fine on subsequent runs

I've tracked it down to spacy.language.Language by isolating it with logging statements on either side of the function call. The error is not caught by a try/except block.

Three examples of sentences that it has crashed on:

le do la camera del sole al primo piano.
marco is offering you a drink.
non c'è niente qua giù.

The model being used is it-core-news-lg==3.8.0.
Update: The crash occurs on the English large model too.

Any advice is appreciated!

How to reproduce the behaviour

Simplified lemmatizer class:

import spacy

class ClassName():
    _nlp: spacy.language.Language

    def __init__(self, model_name: str):
        self._nlp = spacy.load(name=model_name)

    def lemmatize(self, input_str : str):

        # Random crashes on this line
        # Try / except doesn't make any difference
        doc = self._nlp(text=input_str) 
        
        # Do stuff
        return stuff 

Simplified application code

lemmatizer = ClassName(model_name="it-core-news-lg")
for sentence in big_sentence_list:
    x = lemmatizer.lemmatize(sentence)

Actual class being used is here
Actual application code is here, within the generate_frequency_analysis function.

The code crashes on ~20% of the runs, even with identical input data. Each run has subtitles from ~100 minutes worth of mixed Italian / English content.

Your Environment

  • spaCy version: 3.8.2
  • Platform: Windows-11-10.0.26100-SP0
  • Python version: 3.12.7
  • Model: it-core-news-lg==3.8.0
  • System: Running on CPU: Ryzen 5600X, 32 GB RAM, Running on GPU: RTX 3070 Ti
  • pip list (from application code):
aiohappyeyeballs              2.4.3
aiohttp                       3.11.6
aiosignal                     1.3.1
alembic                       1.14.0
annotated-types               0.7.0
antlr4-python3-runtime        4.9.3
argos-spacy-compatibility     0.1.0
asteroid-filterbanks          0.4.0
attrs                         24.2.0
audioread                     3.0.1
av                            12.3.0
blis                          1.0.1
cached-property               2.0.1
catalogue                     2.0.10
certifi                       2024.8.30
cffi                          1.17.1
charset-normalizer            3.4.0
chevron                       0.14.0
click                         8.1.7
cloudpathlib                  0.20.0
colorama                      0.4.6
coloredlogs                   15.0.1
colorlog                      6.9.0
confection                    0.1.5
contourpy                     1.3.1
ctranslate2                   4.5.0
cycler                        0.12.1
cymem                         2.0.8
decorator                     5.1.1
docopt                        0.6.2
einops                        0.8.0
et-xmlfile                    2.0.0
faster-whisper                1.0.3
ffmpeg-python                 0.2.0
filelock                      3.16.1
flatbuffers                   24.3.25
fonttools                     4.55.0
frozendict                    2.4.6
frozenlist                    1.5.0
fsspec                        2024.10.0
future                        1.0.0
genanki                       0.13.1
gogadget                      0.2.2
greenlet                      3.1.1
huggingface-hub               0.26.2
humanfriendly                 10.0
hyperpyyaml                   1.2.2
idna                          3.10
jinja2                        3.1.4
joblib                        1.4.2
julius                        0.2.7
kiwisolver                    1.4.7
langcodes                     3.5.0
language-data                 1.3.0
lazy-loader                   0.4
lemon-tizer                   0.0.5
librosa                       0.10.2.post1
lightning                     2.4.0
lightning-utilities           0.11.9
llvmlite                      0.43.0
mako                          1.3.6
marisa-trie                   1.2.1
markdown-it-py                3.0.0
markupsafe                    3.0.2
matplotlib                    3.9.2
mdurl                         0.1.2
mpmath                        1.3.0
msgpack                       1.1.0
multidict                     6.1.0
murmurhash                    1.0.10
networkx                      3.4.2
nltk                          3.9.1
numba                         0.60.0
numpy                         2.0.2
omegaconf                     2.3.0
onnxruntime                   1.20.0
openpyxl                      3.1.5
optuna                        4.1.0
packaging                     24.2
pandas                        2.2.3
pillow                        11.0.0
pip                           24.3.1
platformdirs                  4.3.6
pooch                         1.8.2
preshed                       3.0.9
primepy                       1.3
propcache                     0.2.0
protobuf                      5.28.3
pyannote-audio                3.3.2
pyannote-core                 5.0.0
pyannote-database             5.1.0
pyannote-metrics              3.2.1
pyannote-pipeline             3.0.1
pycparser                     2.22
pydantic                      2.9.2
pydantic-core                 2.23.4
pygments                      2.18.0
pyparsing                     3.2.0
pyreadline3                   3.5.4
pysubs2                       1.7.3
python-dateutil               2.9.0.post0
pytorch-lightning             2.4.0
pytorch-metric-learning       2.7.0
pytz                          2024.2
pyyaml                        6.0.2
regex                         2024.11.6
requests                      2.32.3
rich                          13.9.4
rtoml                         0.11.0
ruamel-yaml                   0.18.6
ruamel-yaml-clib              0.2.12
sacremoses                    0.0.53
safetensors                   0.4.5
scikit-learn                  1.5.2
scipy                         1.14.1
semver                        3.0.2
sentencepiece                 0.2.0
setuptools                    75.5.0
shellingham                   1.5.4
six                           1.16.0
smart-open                    7.0.5
sortedcontainers              2.4.0
soundfile                     0.12.1
soxr                          0.5.0.post1
spacy                         3.8.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
speechbrain                   1.0.2
sqlalchemy                    2.0.36
srsly                         2.4.8
sympy                         1.13.1
tabulate                      0.9.0
tensorboardx                  2.6.2.2
thinc                         8.3.2
threadpoolctl                 3.5.0
tokenizers                    0.20.3
tomlkit                       0.13.2
torch                         2.5.1
torch-audiomentations         0.11.1
torch-pitch-shift             1.2.5
torchaudio                    2.5.1
torchmetrics                  1.6.0
tqdm                          4.67.0
transformers                  4.46.3
typer                         0.13.1
typing-extensions             4.12.2
tzdata                        2024.2
urllib3                       2.2.3
wasabi                        1.1.3
weasel                        0.4.1
whisperx-numpy2-compatibility 0.1.0
wrapt                         1.16.0
yarl                          1.17.2
yt-dlp                        2024.11.18
@jonathanfox5 jonathanfox5 changed the title Segmentation Fault Windows (Processing on CPU) Segmentation Fault when running spacy.language.Language (Windows, Processing on CPU) Nov 20, 2024
@jonathanfox5 jonathanfox5 changed the title Segmentation Fault when running spacy.language.Language (Windows, Processing on CPU) Segmentation Fault when running lemmatisation (Windows, Processing on CPU) Nov 20, 2024
@jonathanfox5 jonathanfox5 changed the title Segmentation Fault when running lemmatisation (Windows, Processing on CPU) [v3.8.2] Segmentation Fault when running lemmatisation (Windows) Nov 20, 2024
@atlaste
Copy link

atlaste commented Nov 27, 2024

I'm having the same issues on the Dutch models. I tried on 3 different machines now to ensure it's not a problem with a certain installation.

The event viewer shows:

Faulting application name: python.exe, version: 3.12.150.1013, time stamp: 0x651ac086
Faulting module name: cy.cp312-win_amd64.pyd, version: 0.0.0.0, time stamp: 0x66e370c9
Exception code: 0xc0000005
Fault offset: 0x00000000000964ca
Faulting process id: 0x0x12DCC
Faulting application start time: 0x0x1DB40B111D4A7C6
Faulting application path: C:\Python312\python.exe
Faulting module path: C:\Python312\Lib\site-packages\blis\cy.cp312-win_amd64.pyd
Report Id: 2ae974e4-03be-4018-b6c8-c278468d4b7c
Faulting package full name:
Faulting package-relative application ID:

I'm not sure how to get a proper stacktrace. The failt offset is the same every time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants