Skip to content

Latest commit

 

History

History
123 lines (98 loc) · 5.53 KB

README.md

File metadata and controls

123 lines (98 loc) · 5.53 KB

WikiPeople and its derivatives: N-ary relational datasets about people derived from Wikidata

WikiPeople was constructed as follows:

  • The Wikidata dump was downloaded and the facts concerning entities of type human were extracted.
  • Then, these facts were denoised. For example, facts containing element related to image were filtered out, and facts containing element in {unknown value, no values} were removed.
  • Subsequently, the subsets of elements which have at least 30 mentions were selected. And the facts related to these elements were kept. Further, each fact was parsed into a set of its role-value pairs.
  • The remaining facts were randomly split into training set, validation set and test set by a percentage of 80%:10%:10%.

The statistics of WikiPeople are displayed as follows, where #Train, #Valid and #Test are the sizes of the training set, the validation set and the test set, respectively.

Binary N-ary Overall
#Train 270,179 35,546 305,725
#Valid 33,845 4,378 38,223
#Test 33,890 4,391 38,281

The training facts, validation facts and test facts stored in the files n-ary_train.json, n-ary_valid.json and n-ary_test.json, respectively, are in the same format. Each line therein is a set of ("role id": "value id/list of value ids") and the arity information in form of ("N": arity), corresponding to a fact in Wikidata about a certain person. Note that all the ids, except the ones that end with "_h" or "_t", are directly adopted from Wikidata. The two types of ids ending with "_h" or "_t" are defined by WikiPeople.

A fact example of WikiPeople

Take Line 37714 in n-ary_valid.json for example:

{
  "P166_h": "Q7186", 
  "P166_t": "Q38104", 
  "N": 5, 
  "P585": ["+1903-01-01T00:00:00Z#0#0#0#9#http://www.wikidata.org/entity/Q1985727"], 
  "P1706": ["Q41269", "Q37463"]
}

This example corresponds to the fact in Wikidata: Marie Curie received Nobel Prize in Physics in 1903, together with Henri Becquerel and Pierre Curie.

The detailed description of Line 37714 is as follows (items newly defined or introdued by WikiPeople are in italics and underlined):

Item Description
P166 The id of the relation "award received"
P166_h The id of the subject role of "award received"
Q7186 The id of the value "Marie Curie"
P166_t The id of the object role of "award received"
Q38104 The id of the value "Nobel Prize in Physics"
"N":5 The arity of this fact is 5
P585 The id of the role "point in time"
+1903-01-01T00:00:00Z#0#0#0#9
#http://www.wikidata.org/entity/Q1985727
The id of the value "1903"
P1706 The id of the role "together with"
Q41269 The id of the value "Henri Becquerel"
Q37463 The id of the value "Pierre Curie"

Derive WikiPeople-n from WikiPeople

Since on WikiPeople, the percentage of n-ary relational facts is low (less than 12%), a new dataset, WikiPeople-n, was derived from WikiPeople. Detailedly, based on WikiPeople, all the n-ary relational facts were kept, and some binary relational facts were randomly removed to obtain the same percentage of binary and n-ary categories as in the training set on JF17K.

The statistics of WikiPeople-n are displayed as follows:

Binary N-ary Overall
#Train 48,851 35,546 84,397
#Valid 6,190 4,248 10,438
#Test 6,186 4,264 10,450

Note that, binary relational facts from the training set, and then the validation set and the test set, were randomly removed respectively. After removing some binary relational facts from the training set, some elements (roles/values) may only exist in the validation/test set. The facts in the validation set and the test set, which contain these elements, were removed first, before conducting random removal.

Further vary the percentage of binary relational facts in WikiPeople

We vary the percentage (0%, 50%, and 100%) of binary relational facts in WikiPeople to get three datasets, WikiPeople-0bi, WikiPeople-50bi, and WikiPeople-100bi, respectively.

The statistics of these three datasets are presented as follows:

Dataset WikiPeople-0bi WikiPeople-50bi WikiPeople-100bi
Category N-ary/Overall Binary N-ary Overall Binary/Overall
#Train 35,546 35,590 35,546 71,136 270,179
#Valid 3,912 4,234 4,218 8,452 33,649
#Test 3,930 4,253 4,228 8,481 33,694

When using the datasets, please cite:

@inproceedings{NaLP,
  title={Link prediction on n-ary relational data},
  author={Guan, Saiping and Jin, Xiaolong and Wang, Yuanzhuo and Cheng, Xueqi},
  booktitle={Proceedings of the 28th International Conference on World Wide Web (WWW'19)},
  year={2019},
  pages={583--593}
}

For any questions, please contact [email protected] or [email protected], or open an issue.