Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor span assignment from cas to doc, exclude specified labels #136

Merged
merged 38 commits into from
Aug 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
a7875ed
refactor span assignment from cas to doc, exclude specified labels
iulusoy Aug 18, 2023
073051a
adjust tests for discarded labels and missing split
iulusoy Aug 18, 2023
7f5609d
adjust tests for discarded labels and missing split
iulusoy Aug 18, 2023
b732238
remove obsolete line
iulusoy Aug 18, 2023
a7fa234
Merge branch 'main' into refactor-doc-from-cas
iulusoy Aug 18, 2023
6f670a1
create span lists
iulusoy Aug 21, 2023
0af4f9d
place spans in dataset for spacy
iulusoy Aug 22, 2023
c339f23
added explanation for empty span warning
GwydionJon Aug 23, 2023
4f0f2d9
passed merge dict and task to _merge_span_categories
GwydionJon Aug 23, 2023
b7384e7
added merge dict example to spacy model notebook
GwydionJon Aug 23, 2023
7e67b97
changed visualize data function name
GwydionJon Aug 23, 2023
f173143
added kat5 implizit to task 5
GwydionJon Aug 23, 2023
d270bf5
merge major restructure of data flow for test/train split spacy
iulusoy Aug 25, 2023
616e0dd
major restructure of data flow for test/train split spacy set default…
iulusoy Aug 25, 2023
7943932
todo for task and data source information
iulusoy Aug 25, 2023
4e7b1d7
update test for merge dict changes
iulusoy Aug 25, 2023
9916fd5
make SpacyDataHandler methods static
iulusoy Aug 25, 2023
7697ea2
removed removing of kat5 forderung implizit from _return_span_analyzer
GwydionJon Aug 25, 2023
4c3a77c
Merge branch 'refactor-doc-from-cas' of https://github.com/ssciwr/mor…
GwydionJon Aug 25, 2023
87870e1
keep instance of tdh class
iulusoy Aug 25, 2023
130015f
add test for cas_to_doc
iulusoy Aug 25, 2023
b3ab215
fixed merge issue
GwydionJon Aug 25, 2023
21cb153
first cleanups
iulusoy Aug 25, 2023
66773dc
fix test for array size
iulusoy Aug 25, 2023
104d83f
refactor assign span
iulusoy Aug 28, 2023
5b3a6e4
simplify docbin from dataset
iulusoy Aug 28, 2023
fe99c3d
fix test fluke with too small test data
iulusoy Aug 28, 2023
95ace29
pass column names to docbin generation
iulusoy Aug 28, 2023
6208631
pass column names to docbin generation
iulusoy Aug 28, 2023
fae06dc
get rid of obsolete path type conversion
iulusoy Aug 28, 2023
1d47f7c
add selected labels, task and filenames into dataset description
iulusoy Aug 28, 2023
97a241d
add selected labels, task and filenames into dataset description
iulusoy Aug 28, 2023
44efb35
correct variable passing in tests
iulusoy Aug 28, 2023
88552c8
check task in model is same as in data spacy
iulusoy Aug 29, 2023
4efb4ba
reduce code smells
iulusoy Aug 29, 2023
4eb51d6
remove duplicate method call
iulusoy Aug 29, 2023
a1d0dd2
remove outdated comment
iulusoy Aug 29, 2023
b7be1c7
update transformers notebook for new dataflow
iulusoy Aug 29, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,5 @@ data/
.vscode/settings.json
notebooks/my_model/

# keep test data
!/moralization/data
1 change: 0 additions & 1 deletion moralization/analyse.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ def _return_span_analyzer(doc_dict):
for doc in doc_dict.values():
# doc.spans.pop("paragraphs", None)
doc.spans.pop("KOMMENTAR", None)
doc.spans.pop("KAT5-Forderung implizit", None)
doc_list.append(doc)

return SpanAnalyzer(doc_list)
Expand Down

Large diffs are not rendered by default.

Large diffs are not rendered by default.

40,586 changes: 40,586 additions & 0 deletions moralization/data/large_input_data/Kommentare-pos-RR-neu-optimiert-CK.xmi

Large diffs are not rendered by default.

238 changes: 130 additions & 108 deletions moralization/data_manager.py

Large diffs are not rendered by default.

279 changes: 150 additions & 129 deletions moralization/input_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,115 +158,126 @@
"Protagonistinnen3": "KAT3-own/other",
"KommunikativeFunktion": "KAT4-Kommunikative Funktion",
"Forderung": "KAT5-Forderung explizit",
# "KAT5Ausformulierung": "KAT5-Forderung implizit",
"KAT5Ausformulierung": "KAT5-Forderung implizit",
# "Kommentar": "KOMMENTAR",
}

nlp = spacy_load_model(language_model)
doc = nlp(cas.sofa_string)

doc_train = nlp(cas.sofa_string)
doc_test = nlp(cas.sofa_string)

# add original cassis sentence as paragraph span
# initalize the SpanGroup objects
doc.spans["sc"] = []
doc.spans["paragraphs"] = []
for cat in map_expressions.values():
doc.spans[cat] = []

# now put the paragraphs (instances/segments) into the SpanGroup "paragraphs"
# these are defined as cas sentences in the input
sentence_type = ts.get_type(
"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence"
)
paragraph_list = cas.select(sentence_type.name)
doc = InputOutput._get_paragraphs(doc, paragraph_list)

# initilize all span categories
for doc_object in [doc, doc_train, doc_test]:
doc_object.spans["sc"] = []
doc_object.spans["paragraphs"] = []
for cat in map_expressions.values():
doc_object.spans[cat] = []

paragraph_list = cas.select(sentence_type.name)
for paragraph in paragraph_list:
doc_object.spans["paragraphs"].append(
doc_object.char_span(
paragraph.begin,
paragraph.end,
label="paragraph",
)
)
# now put the different categories of the custom spans (ie Kat1, etc) into
# SpanGroups
span_type = ts.get_type("custom.Span")

span_list = cas.select(span_type.name)

doc, doc_train, doc_test = InputOutput._split_train_test(
doc, doc_train, doc_test, span_list, map_expressions
)

return doc, doc_train, doc_test
# now assign the spans and labels in the doc object from the cas object
doc = InputOutput._assign_span_labels(doc, span_list, map_expressions)
return doc

@staticmethod
def _split_train_test(doc, doc_train, doc_test, span_list, map_expressions):
# every n-th entry is put as a test value
n_test = 5
n_start = 0
def _get_paragraphs(doc, paragraph_list):
# add original cassis sentence as paragraph span
for paragraph in paragraph_list:
doc.spans["paragraphs"].append(
doc.char_span(
paragraph.begin,
paragraph.end,
label="paragraph",
)
)
return doc

@staticmethod
def _assign_span_labels(doc, span_list, map_expressions):
# put the custom spans into the categories
# we also need to delete "Moralisierung" and "Keine Moralisierung"
labels_to_delete = ["Keine Moralisierung", "Moralisierung"]
for span in span_list:
for cat_old, cat_new in map_expressions.items():
# not all of these categories have values in every span.
if span[cat_old]:
if span[cat_old] and span[cat_old] not in labels_to_delete:
# we need to attach each span category on its own, as well as all together in "sc"

char_span = doc.char_span(
span.begin,
span.end,
label=span[cat_old],
char_span = InputOutput._get_char_span(cat_old, doc, span)
doc = InputOutput._append_char_span(
doc, cat_new, char_span, span, cat_old
)
if char_span:
doc.spans[cat_new].append(char_span)
doc.spans["sc"].append(char_span)
n_start = n_start + 1

if n_start % n_test != 0:
char_span_train = doc_train.char_span(
span.begin,
span.end,
label=span[cat_old],
)
doc_train.spans[cat_new].append(char_span_train)
doc_train.spans["sc"].append(char_span_train)
else:
char_span_test = doc_test.char_span(
span.begin,
span.end,
label=span[cat_old],
)
doc_test.spans[cat_new].append(char_span_test)
doc_test.spans["sc"].append(char_span_test)

# char_span returns None when the given indices do not match a token begin and end.
# e.G ".Ich" instead of ". Ich"
elif char_span is None:
logging_warning = f"The char span for {span.get_covered_text()} ({span}) returned None.\n"
logging_warning += (
"It might be due to a mismatch between char indices. \n"
)
if logging.root.level > logging.DEBUG:
logging_warning += "Skipping span! Enable Debug Logging for more information."

logging.warning(logging_warning)
logging.debug(
f"""Token should be: \n \t'{span.get_covered_text()}', but is '{
doc.char_span(
span.begin,
span.end,
alignment_mode="expand",
label=span[cat_old],

)}'\n"""
)

# create test and train set:

return doc, doc_train, doc_test
return doc

@staticmethod
def _get_char_span(cat_old, doc, span):
# Kat5 implicit has a long string inside the label
# we need to delete this string and instead put "implicit"
if cat_old == "KAT5Ausformulierung":
char_span = doc.char_span(
span.begin,
span.end,
label="implizit",
)
else:
char_span = doc.char_span(
span.begin,
span.end,
label=span[cat_old],
)
return char_span

@staticmethod
def _append_char_span(doc, cat_new, char_span, span, cat_old):
if char_span:
doc.spans[cat_new].append(char_span)
doc.spans["sc"].append(char_span)
# char_span returns None when the given indices do not match a token begin and end.
# e.G ".Ich" instead of ". Ich"
# The problem stems from a mismatch between spacy token beginnings and cassis token beginning.
# This might be due to the fact that spacy tokenizes on whitespace and cassis on punctuation.
# This leads to a mismatch between the indices of the tokens,
# where spacy sees ".Ich" as a single token
# cassis on the other hand returns only the indices for I and h as start and end point,
# thus spacy complains that the start ID is not actually the beginning of the token.
# We could fix this by trying reduce the index by 1 and check if the token is not complete.
# However this would give us some tokens that are not actually Words and
# thus are not useful for training.
# print a warning that this span cannot be used
elif char_span is None:
InputOutput._warn_empty_span(doc, span, cat_old)
return doc

@staticmethod
def _warn_empty_span(doc, span, cat_old):
logging_warning = (
f"The char span for {span.get_covered_text()} ({span}) returned None.\n"
)
logging_warning += "It might be due to a mismatch between char indices. \n"
if logging.root.level > logging.DEBUG:
logging_warning += (
"Skipping span! Enable Debug Logging for more information."
)
logging.warning(logging_warning)
logging.debug(
f"""Token should be: \n \t'{span.get_covered_text()}', but is '{
doc.char_span(
span.begin,
span.end,
alignment_mode="expand",
label=span[cat_old],
)}'\n"""
)

@staticmethod
def files_to_docs(
data_files: List or str, ts: object, language_model: str = "de_core_news_sm"
data_files: List, ts: object, language_model: str = "de_core_news_sm"
):
"""

Expand All @@ -280,36 +291,39 @@

"""
doc_dict = {}
train_dict = {}
test_dict = {}

for file in data_files:
logging.info(f"Reading ./{file}")
try:
cas, file_type = InputOutput.read_cas_file(file, ts)
doc, doc_train, doc_test = InputOutput.cas_to_doc(
cas, ts, language_model
)
cas, _ = InputOutput.read_cas_file(file, ts)
doc = InputOutput.cas_to_doc(cas, ts, language_model)
doc_dict[file.stem] = doc
train_dict[file.stem] = doc_train
test_dict[file.stem] = doc_test

except XMLSyntaxError as e:
logging.warning(
f"WARNING: skipping file '{file}' due to XMLSyntaxError: {e}"
)

return doc_dict, train_dict, test_dict
return doc_dict

@staticmethod
def _merge_span_categories(doc_dict, merge_dict=None):
def _merge_span_categories(doc_dict, merge_dict=None, task=None):
"""Take the new_dict_cat dict and add its key as a main_cat to data_dict.
The values are the total sub_dict_entries of the given list.

Args:
doc_dict(dict: doc): The provided doc dict.
new_dict_cat(dict): map new category to list of existing_categories.

doc_dict(dict: doc): The provided doc dict.
merge_dict_cat(dict, optional): map new category to list of existing_categories.
merge_dict = {
"task1": ["KAT1-Moralisierendes Segment"],
"task2": ["KAT2-Moralwerte", "KAT2-Subjektive Ausdrücke"],
"task3": ["KAT3-Rolle", "KAT3-Gruppe", "KAT3-own/other"],
"task4": ["KAT4-Kommunikative Funktion"],
"task5": ["KAT5-Forderung explizit", "KAT5-Forderung implizit"],
}
Defaults to None.
task (str, optional): The task from which the labels are selected.
By default task 1 is selected. Default is None.
Return:
dict: The data_dict with new span categories.
"""
Expand All @@ -319,49 +333,56 @@
"task2": ["KAT2-Moralwerte", "KAT2-Subjektive Ausdrücke"],
"task3": ["KAT3-Rolle", "KAT3-Gruppe", "KAT3-own/other"],
"task4": ["KAT4-Kommunikative Funktion"],
"task5": ["KAT5-Forderung explizit"],
"task5": ["KAT5-Forderung explizit", "KAT5-Forderung implizit"],
}
if task is None:
task = "task1"

if task not in merge_dict.keys():
raise KeyError(

Check warning on line 342 in moralization/input_data.py

View check run for this annotation

Codecov / codecov/patch

moralization/input_data.py#L342

Added line #L342 was not covered by tests
f"{task} not in merge_dict. Please provide a valid task or include the given task in the merge dict."
)

# now we only need to merge categories for the given task.
merge_categories = merge_dict[task]

for file in doc_dict.keys():
# initilize new span_groups
for cat in merge_dict.keys():
doc_dict[file].spans[cat] = []

for new_main_cat, new_cat_entries in merge_dict.items():
if new_cat_entries == "all":
for main_cat in list(doc_dict[file].spans.keys()):
doc_dict[file].spans[new_main_cat].extend(
doc_dict[file].spans[main_cat]
)
else:
for old_main_cat in new_cat_entries:
doc_dict[file].spans[new_main_cat].extend(
doc_dict[file].spans[old_main_cat]
)
# initilize new span_group
doc_dict[file].spans[task] = []

for old_main_cat in merge_categories:
try:
doc_dict[file].spans[task].extend(
doc_dict[file].spans[old_main_cat]
)

except KeyError:
raise KeyError(

Check warning on line 360 in moralization/input_data.py

View check run for this annotation

Codecov / codecov/patch

moralization/input_data.py#L359-L360

Added lines #L359 - L360 were not covered by tests
f"{old_main_cat} not found in doc_dict[file].spans which"
+ f" has {list(doc_dict[file].spans.keys())} as keys."
)
return doc_dict

@staticmethod
def read_data(dir: str, language_model: str = "de_core_news_sm"):
def read_data(
dir: str, language_model: str = "de_core_news_sm", merge_dict=None, task=None
):
"""Convenience method to handle input reading in one go.

Args:
dir (str): Path to the data directory.
language_model (str, optional): Language model of the corpus that is being read.
Defaults to "de_core_news_sm" (German).

dir (str): Path to the data directory.
language_model (str, optional): Language model of the corpus that is being read.
Defaults to "de_core_news_sm" (German).
merge_dict_cat(dict, optional): map new category to list of existing_categories.
task (str, optional): which task to use in the merge. Defaults to None.
Returns:
doc_dict (dict): Dictionary of with all the available data in one.
train_dict (dict): Dictionary with only the spans that are used for training.
test_dict (dict): Dictionary with only the spans that are used for testing.
"""
data_files, ts_file = InputOutput.get_multiple_input(dir)
# read in the ts
ts = InputOutput.read_typesystem(ts_file)
doc_dict, train_dict, test_dict = InputOutput.files_to_docs(
doc_dict = InputOutput.files_to_docs(
data_files, ts, language_model=language_model
)

for dict_ in [doc_dict, train_dict, test_dict]:
dict_ = InputOutput._merge_span_categories(dict_)

return doc_dict, train_dict, test_dict
doc_dict = InputOutput._merge_span_categories(doc_dict, merge_dict, task)
return doc_dict
Loading