[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

jonathanfox5 · 2024-11-19T23:59:47Z

Update

I've done some digging and this only seems to affect v3.8. Downgrading to v3.7 fixes the problem.

The only 3.8 version I've tried is 3.8.2 so I'm unsure if 3.8.0 / 3.8.1 are also affected.

Overview

When running spacy.language.Language in a script on Windows, it randomly produces a segmentation fault (the behaviour in powershell is to stop execution of script and you need to run it in bash to see the "segmentation fault" error). This error does NOT appear on macOS, even in identical environments.

There appears to be no link between the text input and the crashes since:

It crashes on a different sentence each time
It processes the same sentence perfectly fine on subsequent runs

I've tracked it down to spacy.language.Language by isolating it with logging statements on either side of the function call. The error is not caught by a try/except block.

Three examples of sentences that it has crashed on:

le do la camera del sole al primo piano.
marco is offering you a drink.
non c'è niente qua giù.

The model being used is it-core-news-lg==3.8.0.
Update: The crash occurs on the English large model too.

Any advice is appreciated!

How to reproduce the behaviour

Simplified lemmatizer class:

import spacy

class ClassName():
    _nlp: spacy.language.Language

    def __init__(self, model_name: str):
        self._nlp = spacy.load(name=model_name)

    def lemmatize(self, input_str : str):

        # Random crashes on this line
        # Try / except doesn't make any difference
        doc = self._nlp(text=input_str) 
        
        # Do stuff
        return stuff

Simplified application code

lemmatizer = ClassName(model_name="it-core-news-lg")
for sentence in big_sentence_list:
    x = lemmatizer.lemmatize(sentence)

Actual class being used is here
Actual application code is here, within the generate_frequency_analysis function.

The code crashes on ~20% of the runs, even with identical input data. Each run has subtitles from ~100 minutes worth of mixed Italian / English content.

Your Environment

spaCy version: 3.8.2
Platform: Windows-11-10.0.26100-SP0
Python version: 3.12.7
Model: it-core-news-lg==3.8.0
System: Running on CPU: Ryzen 5600X, 32 GB RAM, Running on GPU: RTX 3070 Ti
pip list (from application code):

aiohappyeyeballs              2.4.3
aiohttp                       3.11.6
aiosignal                     1.3.1
alembic                       1.14.0
annotated-types               0.7.0
antlr4-python3-runtime        4.9.3
argos-spacy-compatibility     0.1.0
asteroid-filterbanks          0.4.0
attrs                         24.2.0
audioread                     3.0.1
av                            12.3.0
blis                          1.0.1
cached-property               2.0.1
catalogue                     2.0.10
certifi                       2024.8.30
cffi                          1.17.1
charset-normalizer            3.4.0
chevron                       0.14.0
click                         8.1.7
cloudpathlib                  0.20.0
colorama                      0.4.6
coloredlogs                   15.0.1
colorlog                      6.9.0
confection                    0.1.5
contourpy                     1.3.1
ctranslate2                   4.5.0
cycler                        0.12.1
cymem                         2.0.8
decorator                     5.1.1
docopt                        0.6.2
einops                        0.8.0
et-xmlfile                    2.0.0
faster-whisper                1.0.3
ffmpeg-python                 0.2.0
filelock                      3.16.1
flatbuffers                   24.3.25
fonttools                     4.55.0
frozendict                    2.4.6
frozenlist                    1.5.0
fsspec                        2024.10.0
future                        1.0.0
genanki                       0.13.1
gogadget                      0.2.2
greenlet                      3.1.1
huggingface-hub               0.26.2
humanfriendly                 10.0
hyperpyyaml                   1.2.2
idna                          3.10
jinja2                        3.1.4
joblib                        1.4.2
julius                        0.2.7
kiwisolver                    1.4.7
langcodes                     3.5.0
language-data                 1.3.0
lazy-loader                   0.4
lemon-tizer                   0.0.5
librosa                       0.10.2.post1
lightning                     2.4.0
lightning-utilities           0.11.9
llvmlite                      0.43.0
mako                          1.3.6
marisa-trie                   1.2.1
markdown-it-py                3.0.0
markupsafe                    3.0.2
matplotlib                    3.9.2
mdurl                         0.1.2
mpmath                        1.3.0
msgpack                       1.1.0
multidict                     6.1.0
murmurhash                    1.0.10
networkx                      3.4.2
nltk                          3.9.1
numba                         0.60.0
numpy                         2.0.2
omegaconf                     2.3.0
onnxruntime                   1.20.0
openpyxl                      3.1.5
optuna                        4.1.0
packaging                     24.2
pandas                        2.2.3
pillow                        11.0.0
pip                           24.3.1
platformdirs                  4.3.6
pooch                         1.8.2
preshed                       3.0.9
primepy                       1.3
propcache                     0.2.0
protobuf                      5.28.3
pyannote-audio                3.3.2
pyannote-core                 5.0.0
pyannote-database             5.1.0
pyannote-metrics              3.2.1
pyannote-pipeline             3.0.1
pycparser                     2.22
pydantic                      2.9.2
pydantic-core                 2.23.4
pygments                      2.18.0
pyparsing                     3.2.0
pyreadline3                   3.5.4
pysubs2                       1.7.3
python-dateutil               2.9.0.post0
pytorch-lightning             2.4.0
pytorch-metric-learning       2.7.0
pytz                          2024.2
pyyaml                        6.0.2
regex                         2024.11.6
requests                      2.32.3
rich                          13.9.4
rtoml                         0.11.0
ruamel-yaml                   0.18.6
ruamel-yaml-clib              0.2.12
sacremoses                    0.0.53
safetensors                   0.4.5
scikit-learn                  1.5.2
scipy                         1.14.1
semver                        3.0.2
sentencepiece                 0.2.0
setuptools                    75.5.0
shellingham                   1.5.4
six                           1.16.0
smart-open                    7.0.5
sortedcontainers              2.4.0
soundfile                     0.12.1
soxr                          0.5.0.post1
spacy                         3.8.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
speechbrain                   1.0.2
sqlalchemy                    2.0.36
srsly                         2.4.8
sympy                         1.13.1
tabulate                      0.9.0
tensorboardx                  2.6.2.2
thinc                         8.3.2
threadpoolctl                 3.5.0
tokenizers                    0.20.3
tomlkit                       0.13.2
torch                         2.5.1
torch-audiomentations         0.11.1
torch-pitch-shift             1.2.5
torchaudio                    2.5.1
torchmetrics                  1.6.0
tqdm                          4.67.0
transformers                  4.46.3
typer                         0.13.1
typing-extensions             4.12.2
tzdata                        2024.2
urllib3                       2.2.3
wasabi                        1.1.3
weasel                        0.4.1
whisperx-numpy2-compatibility 0.1.0
wrapt                         1.16.0
yarl                          1.17.2
yt-dlp                        2024.11.18

The text was updated successfully, but these errors were encountered:

atlaste · 2024-11-27T09:49:35Z

I'm having the same issues on the Dutch models. I tried on 3 different machines now to ensure it's not a problem with a certain installation.

The event viewer shows:

Faulting application name: python.exe, version: 3.12.150.1013, time stamp: 0x651ac086
Faulting module name: cy.cp312-win_amd64.pyd, version: 0.0.0.0, time stamp: 0x66e370c9
Exception code: 0xc0000005
Fault offset: 0x00000000000964ca
Faulting process id: 0x0x12DCC
Faulting application start time: 0x0x1DB40B111D4A7C6
Faulting application path: C:\Python312\python.exe
Faulting module path: C:\Python312\Lib\site-packages\blis\cy.cp312-win_amd64.pyd
Report Id: 2ae974e4-03be-4018-b6c8-c278468d4b7c
Faulting package full name:
Faulting package-relative application ID:

I'm not sure how to get a proper stacktrace. The failt offset is the same every time.

jonathanfox5 changed the title ~~Segmentation Fault Windows (Processing on CPU)~~ Segmentation Fault when running spacy.language.Language (Windows, Processing on CPU) Nov 20, 2024

jonathanfox5 mentioned this issue Nov 20, 2024

Random Lemmatiser Crashes on Windows jonathanfox5/gogadget#2

Closed

jonathanfox5 changed the title ~~Segmentation Fault when running spacy.language.Language (Windows, Processing on CPU)~~ Segmentation Fault when running lemmatisation (Windows, Processing on CPU) Nov 20, 2024

jonathanfox5 changed the title ~~Segmentation Fault when running lemmatisation (Windows, Processing on CPU)~~ [v3.8.2] Segmentation Fault when running lemmatisation (Windows) Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

jonathanfox5 commented Nov 19, 2024 •

edited

Loading

atlaste commented Nov 27, 2024 •

edited

Loading

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

[v3.8.2] Segmentation Fault when running lemmatisation (Windows) #13692

Comments

jonathanfox5 commented Nov 19, 2024 • edited Loading

Update

Overview

How to reproduce the behaviour

Your Environment

atlaste commented Nov 27, 2024 • edited Loading

jonathanfox5 commented Nov 19, 2024 •

edited

Loading

atlaste commented Nov 27, 2024 •

edited

Loading