Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add relation extraction #858

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Add relation extraction #858

wants to merge 31 commits into from

Conversation

pklpriv
Copy link
Member

@pklpriv pklpriv commented May 21, 2024

Adding support for relation-extraction task.

@pklpriv pklpriv marked this pull request as draft May 21, 2024 13:27
def get_element_representation(self, element, additional_input):
return str(element)

def normalize_answer(s):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed. It's not called by CustomF1. It's a global scope method used only by TokenOverlap.

@@ -1769,8 +1769,11 @@ def compute(
references: List[List[Any]],
predictions: List[Any],
task_data: List[Dict],
reference_extraction_func: Callable[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not needed, we always that the first reference in NER (we assume there is only one reference).
In unitxt, FM eval a reference is a full "correct answer". Note in this case an answer is a List[Tuple[str,str,str]].

@@ -1906,6 +1909,34 @@ def lower(text):
return white_space_fix(remove_articles(remove_punc(lower(s))))


class RelationExtraction(CustomF1):
prediction_type = "List[Tuple[str,str]]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
prediction_type = "List[Tuple[str,str]]"
prediction_type = "List[Tuple[str,str,str]]"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@yoavkatz
Copy link
Member

Hi. I added my comments. I think you should create a card that uses the tasks, and loads the raw data from the file, and converts it to the format required by the task.

FormTask(
inputs={"text": "str", "relation_types": "List[str]"},
outputs={
"entity_surface_form1": "List[str]",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a better name, it took my time to understand what it is maybe "relation_entity_subjects"
"relation_entity_objects" or something more explicit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Przemysław Klocek added 2 commits May 23, 2024 14:12

from unitxt import add_to_catalog

sys.path.append("/Users/pklocek/fm_eval_workspace/unitxt/src/unitxt/metrics.py")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not want this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Why was this added?

@elronbandel
Copy link
Member

elronbandel commented Jun 10, 2024

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:

subjects: List[str]
relations: List[str]
objects: List[str]

The more verbosed version:

subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]

I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

  1. tasks.relation_extraction
  2. tasks.relation_extraction.with_positions

The second ofcourse should be with :

subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]

What do you think @pklpriv and @yoavkatz ?

@yoavkatz
Copy link
Member

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:

subjects: List[str]
relations: List[str]
objects: List[str]

The more verbosed version:

subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]

I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

  1. tasks.relation_extraction
  2. tasks.relation_extraction.with_positions

The second ofcourse should be with :

subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]

What do you think @pklpriv and @yoavkatz ?

I agree that the short names are better, and that there are two tasks (with position and without position). They not only differ in the input, but also in the prediction type (e.g. the position need to be checked). Initially we can support only tasks.relation_extraction. Even if the data is with positions, it should be converted to lists of strings in the pre-processing phase of the card (and not in the template, as done today in NER)

@pklpriv
Copy link
Member Author

pklpriv commented Jun 12, 2024

@elronbandel @yoavkatz

I agree with You, I think this is good way to implement that. Additionally I am currently implementing task/metric for datasets that do not have specified and named relation like (obj1, relation, obj2). Instead, for tuple (ob1,obj2,...,obj_n) we consider being in the same tuple as relation itself.

@elronbandel
Copy link
Member

@pklpriv Can you explain more about the n-ary tuple with unnamed relation? Can you give one input and required output example?

@pklpriv
Copy link
Member Author

pklpriv commented Jun 13, 2024

@elronbandel

I will allow myself to cite Yoav:

“As of December 31, 2011, we had 4,260 employees of whom 2,155, or 51%, are employed in the U.S.; 1,165, or 27%, are employed in Europe; 615, or 14%, are employed in Asia and 325, or 8%, are employed in the Middle East.“,

This is the expected output:

[ {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“4,260\“},
{\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“2155\“,\“EMPLOYEE_PERCENT\“:\“51%\“,\“GEOGRAPHY\“:\“U.S.\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“1165\“,\“EMPLOYEE_PERCENT\“:\“27%\“,\“GEOGRAPHY\“:\“Europe\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“615\“,\“EMPLOYEE_PERCENT\“:\“14%\“,\“GEOGRAPHY\“:\“Asia\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“325\“,\“EMPLOYEE_PERCENT\“:\“8%\“,\“GEOGRAPHY\“:\“Middle East\“}]

As You see, there is no object that represents relation itself (in opposite to f.e. (John,employedBy,Hannah) where 'employedBy' is relation stated as an object). Instead, it is defined by sentence itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants