-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor span assignment from cas to doc, exclude specified labels #136
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov Report
@@ Coverage Diff @@
## main #136 +/- ##
==========================================
+ Coverage 95.97% 96.64% +0.66%
==========================================
Files 22 22
Lines 1915 2059 +144
==========================================
+ Hits 1838 1990 +152
+ Misses 77 69 -8
|
…alization into refactor-doc-from-cas
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
InputOutput
: helper methods, disentangle train test generation from span assignment in the doc objectsDataManager
: excludeKeine Moralisierung
andMoralisierung
from dataset #127, closes random split into test and train set in for spaCy training #133, closes doesKeine Moralisierung
as label mess up the spacy training? #110, closes refactorTransformersDataHandler
data passing and instantiation #138, closes confusing method name #137, closes includeimplicit
in task5 #131, closes example of passing amerge_dict
#130spaCy uses docbin files for the training. We will use the split into train and test from the
Dataset
object (method A in below comment).A.
The test/train export could be done later. In this case, we would keep the one doc object and not split into test/train during the initial stages of the
DataManager
. We could use the Dataset that is split in the end ofDataManager
initialization and create DocBins from the train and test columns. With this we would only have the specified task in the data though, so the logic for spaCy would be quite different, but maybe cleaner. Cleaner because both spaCy and transformers use same pipeline for the data. Cleaner because an exact comparison of the training could be carried out (since it is done with the exact same data). More difficult because in the Dataset object we only have access to token id and not start/end of a span. Could be added to the Dataset object though and then would also allow others to use the same data in spaCy.In this case we would use something like
B.
Manual splitting of train/test data after assigning the spans in
InputData
, similarly as was done before just now that we have a separate loop for this. Advantage: The distribution of the labels could be handled more carefully, so that each set has a similar fraction of the same label.