refactor span assignment from cas to doc, exclude specified labels #136

iulusoy · 2023-08-18T09:18:58Z

refactor methods in InputOutput: helper methods, disentangle train test generation from span assignment in the doc objects
this closes DataManager: exclude Keine Moralisierung and Moralisierung from dataset #127, closes random split into test and train set in for spaCy training #133, closes does Keine Moralisierungas label mess up the spacy training? #110, closes refactor TransformersDataHandler data passing and instantiation #138, closes confusing method name #137, closes include implicit in task5 #131, closes example of passing a merge_dict #130

spaCy uses docbin files for the training. We will use the split into train and test from the Dataset object (method A in below comment).

A.
The test/train export could be done later. In this case, we would keep the one doc object and not split into test/train during the initial stages of the DataManager. We could use the Dataset that is split in the end of DataManager initialization and create DocBins from the train and test columns. With this we would only have the specified task in the data though, so the logic for spaCy would be quite different, but maybe cleaner. Cleaner because both spaCy and transformers use same pipeline for the data. Cleaner because an exact comparison of the training could be carried out (since it is done with the exact same data). More difficult because in the Dataset object we only have access to token id and not start/end of a span. Could be added to the Dataset object though and then would also allow others to use the same data in spaCy.

In this case we would use something like

import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en")
training_data = [
  ("Tokyo Tower is 333m tall.", [(0, 11, "BUILDING")]),
]
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
    doc = nlp(text)
    ents = []
    for start, end, label in annotations:
        span = doc.char_span(start, end, label=label)
        ents.append(span)
    doc.ents = ents
    db.add(doc)
db.to_disk("./train.spacy")

B.
Manual splitting of train/test data after assigning the spans in InputData, similarly as was done before just now that we have a separate loop for this. Advantage: The distribution of the labels could be handled more carefully, so that each set has a similar fraction of the same label.

codecov · 2023-08-18T12:18:48Z

Codecov Report

Merging #136 (b7be1c7) into main (442f36d) will increase coverage by 0.66%.
The diff coverage is 98.25%.

@@            Coverage Diff             @@
##             main     #136      +/-   ##
==========================================
+ Coverage   95.97%   96.64%   +0.66%     
==========================================
  Files          22       22              
  Lines        1915     2059     +144     
==========================================
+ Hits         1838     1990     +152     
+ Misses         77       69       -8

Files Changed	Coverage Δ
moralization/analyse.py	`100.00% <ø> (ø)`
moralization/input_data.py	`97.88% <95.00%> (+5.08%)`	⬆️
moralization/spacy_data_handler.py	`96.19% <95.89%> (-3.81%)`	⬇️
moralization/data_manager.py	`97.33% <98.00%> (+2.11%)`	⬆️
moralization/plot.py	`85.29% <100.00%> (-0.72%)`	⬇️
moralization/spacy_model_manager.py	`99.17% <100.00%> (+0.02%)`	⬆️
moralization/tests/conftest.py	`100.00% <100.00%> (ø)`
moralization/tests/test_analyse.py	`100.00% <100.00%> (ø)`
moralization/tests/test_data_manager.py	`100.00% <100.00%> (ø)`
moralization/tests/test_input_data.py	`100.00% <100.00%> (ø)`
... and 5 more

… task in dm

…alization into refactor-doc-from-cas

sonarcloud · 2023-08-29T09:59:29Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

refactor span assignment from cas to doc, exclude specified labels

a7875ed

iulusoy requested a review from GwydionJon August 18, 2023 11:36

iulusoy added 4 commits August 18, 2023 13:40

adjust tests for discarded labels and missing split

073051a

adjust tests for discarded labels and missing split

7f5609d

remove obsolete line

b732238

Merge branch 'main' into refactor-doc-from-cas

a7fa234

iulusoy and others added 23 commits August 21, 2023 14:02

create span lists

6f670a1

place spans in dataset for spacy

0af4f9d

added explanation for empty span warning

c339f23

passed merge dict and task to _merge_span_categories

4f0f2d9

added merge dict example to spacy model notebook

b7384e7

changed visualize data function name

7e67b97

added kat5 implizit to task 5

f173143

merge major restructure of data flow for test/train split spacy

d270bf5

major restructure of data flow for test/train split spacy set default…

616e0dd

… task in dm

todo for task and data source information

7943932

update test for merge dict changes

4e7b1d7

make SpacyDataHandler methods static

9916fd5

removed removing of kat5 forderung implizit from _return_span_analyzer

7697ea2

Merge branch 'refactor-doc-from-cas' of https://github.com/ssciwr/mor…

4c3a77c

…alization into refactor-doc-from-cas

keep instance of tdh class

87870e1

add test for cas_to_doc

130015f

fixed merge issue

b3ab215

first cleanups

21cb153

fix test for array size

66773dc

refactor assign span

104d83f

simplify docbin from dataset

5b3a6e4

fix test fluke with too small test data

fe99c3d

pass column names to docbin generation

95ace29

iulusoy added 10 commits August 28, 2023 12:10

pass column names to docbin generation

6208631

get rid of obsolete path type conversion

fae06dc

add selected labels, task and filenames into dataset description

1d47f7c

add selected labels, task and filenames into dataset description

97a241d

correct variable passing in tests

44efb35

check task in model is same as in data spacy

88552c8

reduce code smells

4efb4ba

remove duplicate method call

4eb51d6

remove outdated comment

a1d0dd2

update transformers notebook for new dataflow

b7be1c7

iulusoy merged commit cce0d14 into main Aug 29, 2023
6 checks passed

iulusoy deleted the refactor-doc-from-cas branch August 29, 2023 10:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor span assignment from cas to doc, exclude specified labels #136

refactor span assignment from cas to doc, exclude specified labels #136

iulusoy commented Aug 18, 2023 •

edited

Loading

codecov bot commented Aug 18, 2023 •

edited

Loading

sonarcloud bot commented Aug 29, 2023

refactor span assignment from cas to doc, exclude specified labels #136

refactor span assignment from cas to doc, exclude specified labels #136

Conversation

iulusoy commented Aug 18, 2023 • edited Loading

codecov bot commented Aug 18, 2023 • edited Loading

Codecov Report

sonarcloud bot commented Aug 29, 2023

iulusoy commented Aug 18, 2023 •

edited

Loading

codecov bot commented Aug 18, 2023 •

edited

Loading