Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor span assignment from cas to doc, exclude specified labels #136

Merged
merged 38 commits into from
Aug 29, 2023

Conversation

iulusoy
Copy link
Member

@iulusoy iulusoy commented Aug 18, 2023

spaCy uses docbin files for the training. We will use the split into train and test from the Dataset object (method A in below comment).

A.
The test/train export could be done later. In this case, we would keep the one doc object and not split into test/train during the initial stages of the DataManager. We could use the Dataset that is split in the end of DataManager initialization and create DocBins from the train and test columns. With this we would only have the specified task in the data though, so the logic for spaCy would be quite different, but maybe cleaner. Cleaner because both spaCy and transformers use same pipeline for the data. Cleaner because an exact comparison of the training could be carried out (since it is done with the exact same data). More difficult because in the Dataset object we only have access to token id and not start/end of a span. Could be added to the Dataset object though and then would also allow others to use the same data in spaCy.

In this case we would use something like

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./train.spacy")

B.
Manual splitting of train/test data after assigning the spans in InputData, similarly as was done before just now that we have a separate loop for this. Advantage: The distribution of the labels could be handled more carefully, so that each set has a similar fraction of the same label.

@iulusoy iulusoy requested a review from GwydionJon August 18, 2023 11:36
@codecov
Copy link

codecov bot commented Aug 18, 2023

Codecov Report

Merging #136 (b7be1c7) into main (442f36d) will increase coverage by 0.66%.
The diff coverage is 98.25%.

@@            Coverage Diff             @@
##             main     #136      +/-   ##
==========================================
+ Coverage   95.97%   96.64%   +0.66%     
==========================================
  Files          22       22              
  Lines        1915     2059     +144     
==========================================
+ Hits         1838     1990     +152     
+ Misses         77       69       -8     
Files Changed Coverage Δ
moralization/analyse.py 100.00% <ø> (ø)
moralization/input_data.py 97.88% <95.00%> (+5.08%) ⬆️
moralization/spacy_data_handler.py 96.19% <95.89%> (-3.81%) ⬇️
moralization/data_manager.py 97.33% <98.00%> (+2.11%) ⬆️
moralization/plot.py 85.29% <100.00%> (-0.72%) ⬇️
moralization/spacy_model_manager.py 99.17% <100.00%> (+0.02%) ⬆️
moralization/tests/conftest.py 100.00% <100.00%> (ø)
moralization/tests/test_analyse.py 100.00% <100.00%> (ø)
moralization/tests/test_data_manager.py 100.00% <100.00%> (ø)
moralization/tests/test_input_data.py 100.00% <100.00%> (ø)
... and 5 more

@sonarcloud
Copy link

sonarcloud bot commented Aug 29, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@iulusoy iulusoy merged commit cce0d14 into main Aug 29, 2023
6 checks passed
@iulusoy iulusoy deleted the refactor-doc-from-cas branch August 29, 2023 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment