Add relation extraction #858

pklpriv · 2024-05-21T13:26:54Z

Adding support for relation-extraction task.

prepare/tasks/relation_extraction.py

Co-authored-by: Yoav Katz <[email protected]>

yoavkatz · 2024-05-21T13:38:09Z

src/unitxt/metrics.py

+    def get_element_representation(self, element, additional_input):
+        return str(element)
+
+    def normalize_answer(s):


This is not needed. It's not called by CustomF1. It's a global scope method used only by TokenOverlap.

yoavkatz · 2024-05-21T13:40:45Z

src/unitxt/metrics.py

@@ -1769,8 +1769,11 @@ def compute(
        references: List[List[Any]],
        predictions: List[Any],
        task_data: List[Dict],
+        reference_extraction_func: Callable[


This is not needed, we always that the first reference in NER (we assume there is only one reference).
In unitxt, FM eval a reference is a full "correct answer". Note in this case an answer is a List[Tuple[str,str,str]].

yoavkatz · 2024-05-21T13:41:09Z

src/unitxt/metrics.py

@@ -1906,6 +1909,34 @@ def lower(text):
    return white_space_fix(remove_articles(remove_punc(lower(s))))


+class RelationExtraction(CustomF1):
+    prediction_type = "List[Tuple[str,str]]"


Suggested change

prediction_type = "List[Tuple[str,str]]"

prediction_type = "List[Tuple[str,str,str]]"

yoavkatz · 2024-05-21T13:44:10Z

Hi. I added my comments. I think you should create a card that uses the tasks, and loads the raw data from the file, and converts it to the format required by the task.

elronbandel · 2024-05-21T14:42:43Z

prepare/tasks/relation_extraction.py

+    FormTask(
+        inputs={"text": "str", "relation_types": "List[str]"},
+        outputs={
+            "entity_surface_form1": "List[str]",


I think we need a better name, it took my time to understand what it is maybe "relation_entity_subjects"
"relation_entity_objects" or something more explicit

piotrhm · 2024-05-23T15:09:24Z

prepare/metrics/relation_extraction.py

+
+from unitxt import add_to_catalog
+
+sys.path.append("/Users/pklocek/fm_eval_workspace/unitxt/src/unitxt/metrics.py")


I think we do not want this line?

Right. Why was this added?

elronbandel · 2024-06-10T18:52:58Z

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:

subjects: List[str]
relations: List[str]
objects: List[str]

The more verbosed version:

subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]

I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

tasks.relation_extraction
tasks.relation_extraction.with_positions

The second ofcourse should be with :

subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]

What do you think @pklpriv and @yoavkatz ?

yoavkatz · 2024-06-11T05:05:17Z

Since this is an important NLP task i suggest we try to get it merged asap:

My suggestion is to follow the conventions and naming in the TACRED dataset as a representivte of the jargon concensus:

The simple naming for the task output fields is:
subjects: List[str]
relations: List[str]
objects: List[str]
The more verbosed version:
subjects_mentions: List[str]
relations_types: List[str]
objects_mentions: List[str]
I personally prefer the simple one.

Now to the second observation: There are two types of tasks really (1) To produce only mentions and relations (2) To produce mentions with their exact location in the text. Both have different metrics and different use cases IMO.

So my practical suggestion here is to actually have two different tasks:

tasks.relation_extraction

tasks.relation_extraction.with_positions

The second ofcourse should be with :
subjects_starts: List[int]
subjects_ends: List[int]
objects_starts: List[int]
objects_ends: List[int]
What do you think @pklpriv and @yoavkatz ?

I agree that the short names are better, and that there are two tasks (with position and without position). They not only differ in the input, but also in the prediction type (e.g. the position need to be checked). Initially we can support only tasks.relation_extraction. Even if the data is with positions, it should be converted to lists of strings in the pre-processing phase of the card (and not in the template, as done today in NER)

pklpriv · 2024-06-12T15:17:47Z

@elronbandel @yoavkatz

I agree with You, I think this is good way to implement that. Additionally I am currently implementing task/metric for datasets that do not have specified and named relation like (obj1, relation, obj2). Instead, for tuple (ob1,obj2,...,obj_n) we consider being in the same tuple as relation itself.

elronbandel · 2024-06-12T18:19:51Z

@pklpriv Can you explain more about the n-ary tuple with unnamed relation? Can you give one input and required output example?

pklpriv · 2024-06-13T15:10:40Z

@elronbandel

I will allow myself to cite Yoav:

“As of December 31, 2011, we had 4,260 employees of whom 2,155, or 51%, are employed in the U.S.; 1,165, or 27%, are employed in Europe; 615, or 14%, are employed in Asia and 325, or 8%, are employed in the Middle East.“,

This is the expected output:

[ {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“4,260\“},
{\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“2155\“,\“EMPLOYEE_PERCENT\“:\“51%\“,\“GEOGRAPHY\“:\“U.S.\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“1165\“,\“EMPLOYEE_PERCENT\“:\“27%\“,\“GEOGRAPHY\“:\“Europe\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“615\“,\“EMPLOYEE_PERCENT\“:\“14%\“,\“GEOGRAPHY\“:\“Asia\“}, {\“DATE\“:\“December 31, 2011\“,\“EMPLOYEE_NUMBER\“:\“325\“,\“EMPLOYEE_PERCENT\“:\“8%\“,\“GEOGRAPHY\“:\“Middle East\“}]

As You see, there is no object that represents relation itself (in opposite to f.e. (John,employedBy,Hannah) where 'employedBy' is relation stated as an object). Instead, it is defined by sentence itself.

Przemysław Klocek added 6 commits May 21, 2024 13:28

first push

71555bb

Merge branch 'main' into add_relation_extraction

a5cf1a9

tests

0790262

refactor

d169a87

refactor

eeba922

fix

9a675bd

pklpriv marked this pull request as draft May 21, 2024 13:27

yoavkatz reviewed May 21, 2024

View reviewed changes

prepare/tasks/relation_extraction.py Outdated Show resolved Hide resolved

yoavkatz reviewed May 21, 2024

View reviewed changes

prepare/tasks/relation_extraction.py Outdated Show resolved Hide resolved

yoavkatz reviewed May 21, 2024

View reviewed changes

prepare/tasks/relation_extraction.py Outdated Show resolved Hide resolved

pklpriv and others added 3 commits May 21, 2024 15:35

Update prepare/tasks/relation_extraction.py

988e550

Co-authored-by: Yoav Katz <[email protected]>

Update prepare/tasks/relation_extraction.py

8844198

Co-authored-by: Yoav Katz <[email protected]>

Update prepare/tasks/relation_extraction.py

2330dc0

Co-authored-by: Yoav Katz <[email protected]>

yoavkatz reviewed May 21, 2024

View reviewed changes

elronbandel reviewed May 22, 2024

View reviewed changes

Przemysław Klocek added 2 commits May 23, 2024 14:12

fix

02f9fc6

fix

22a97e0

piotrhm reviewed May 23, 2024

View reviewed changes

new task

93f29c1

Przemysław Klocek added 3 commits July 9, 2024 11:10

update

8d7e8a9

Merge branch 'main' into add_relation_extraction

572fc7e

temp

99d63ec

Przemysław Klocek added 16 commits July 29, 2024 16:28

Merge branch 'main' into add_relation_extraction

e94df8d

merge main

11fe158

temp

1b75802

temp

902b5fd

Merge branch 'main' into add_relation_extraction

935fe31

dev

84ba7de

merge

4b0051f

dev

dfa7cc8

Merge branch 'main' into add_relation_extraction

2a9e70c

merge main

7c908eb

temp

59a7de2

merge

4b93095

adding processor

a40a64e

temp

2a7bd50

temp

cbaf813

tests

84cb10f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add relation extraction #858

Add relation extraction #858

pklpriv commented May 21, 2024

yoavkatz May 21, 2024

yoavkatz May 21, 2024

yoavkatz May 21, 2024

pklpriv May 23, 2024

yoavkatz commented May 21, 2024

elronbandel May 21, 2024

pklpriv May 23, 2024

piotrhm May 23, 2024

yoavkatz May 26, 2024

elronbandel commented Jun 10, 2024 •

edited

Loading

yoavkatz commented Jun 11, 2024

pklpriv commented Jun 12, 2024

elronbandel commented Jun 12, 2024

pklpriv commented Jun 13, 2024

	prediction_type = "List[Tuple[str,str]]"
	prediction_type = "List[Tuple[str,str,str]]"


		from unitxt import add_to_catalog

		sys.path.append("/Users/pklocek/fm_eval_workspace/unitxt/src/unitxt/metrics.py")

Add relation extraction #858

Are you sure you want to change the base?

Add relation extraction #858

Conversation

pklpriv commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yoavkatz commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elronbandel commented Jun 10, 2024 • edited Loading

yoavkatz commented Jun 11, 2024

pklpriv commented Jun 12, 2024

elronbandel commented Jun 12, 2024

pklpriv commented Jun 13, 2024

elronbandel commented Jun 10, 2024 •

edited

Loading