-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add relation extraction #858
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Yoav Katz <[email protected]>
Co-authored-by: Yoav Katz <[email protected]>
Co-authored-by: Yoav Katz <[email protected]>
src/unitxt/metrics.py
Outdated
def get_element_representation(self, element, additional_input): | ||
return str(element) | ||
|
||
def normalize_answer(s): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed. It's not called by CustomF1. It's a global scope method used only by TokenOverlap.
src/unitxt/metrics.py
Outdated
@@ -1769,8 +1769,11 @@ def compute( | |||
references: List[List[Any]], | |||
predictions: List[Any], | |||
task_data: List[Dict], | |||
reference_extraction_func: Callable[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not needed, we always that the first reference in NER (we assume there is only one reference).
In unitxt, FM eval a reference is a full "correct answer". Note in this case an answer is a List[Tuple[str,str,str]].
src/unitxt/metrics.py
Outdated
@@ -1906,6 +1909,34 @@ def lower(text): | |||
return white_space_fix(remove_articles(remove_punc(lower(s)))) | |||
|
|||
|
|||
class RelationExtraction(CustomF1): | |||
prediction_type = "List[Tuple[str,str]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prediction_type = "List[Tuple[str,str]]" | |
prediction_type = "List[Tuple[str,str,str]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Hi. I added my comments. I think you should create a card that uses the tasks, and loads the raw data from the file, and converts it to the format required by the task. |
prepare/tasks/relation_extraction.py
Outdated
FormTask( | ||
inputs={"text": "str", "relation_types": "List[str]"}, | ||
outputs={ | ||
"entity_surface_form1": "List[str]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a better name, it took my time to understand what it is maybe "relation_entity_subjects"
"relation_entity_objects" or something more explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
from unitxt import add_to_catalog | ||
|
||
sys.path.append("/Users/pklocek/fm_eval_workspace/unitxt/src/unitxt/metrics.py") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not want this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Why was this added?
Since this is an important NLP task i suggest we try to get it merged asap: My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus: The simple naming for the task output fields is: subjects: List[str]
relations: List[str]
objects: List[str] The more verbosed version: subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str] I personally prefer the simple one. Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO. So my practical suggestion here is to actually have two different tasks:
The second ofcourse should be with : subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int] |
I agree that the short names are better, and that there are two tasks (with position and without position). They not only differ in the input, but also in the prediction type (e.g. the position need to be checked). Initially we can support only tasks.relation_extraction. Even if the data is with positions, it should be converted to lists of strings in the pre-processing phase of the card (and not in the template, as done today in NER) |
I agree with You, I think this is good way to implement that. Additionally I am currently implementing task/metric for datasets that do not have specified and named relation like (obj1, relation, obj2). Instead, for tuple (ob1,obj2,...,obj_n) we consider being in the same tuple as relation itself. |
@pklpriv Can you explain more about the n-ary tuple with unnamed relation? Can you give one input and required output example? |
I will allow myself to cite Yoav: “As of December 31, 2011, we had 4,260 employees of whom 2,155, or 51%, are employed in the U.S.; 1,165, or 27%, are employed in Europe; 615, or 14%, are employed in Asia and 325, or 8%, are employed in the Middle East.“, This is the expected output: [ {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“4,260\“}, As You see, there is no object that represents relation itself (in opposite to f.e. (John,employedBy,Hannah) where 'employedBy' is relation stated as an object). Instead, it is defined by sentence itself. |
Adding support for relation-extraction task.