Skip to content

Commit

Permalink
Add multi-source setting (#34)
Browse files Browse the repository at this point in the history
* Everything except Movies

* Introduce new version of MGB

* Further adaptations for multi-source case

* Adapt IdMapped

* Fix cache_path type, adjust docs

* Adapt file naming and statistics

* Check windows

* Fix windows
  • Loading branch information
dobraczka authored Mar 25, 2024
1 parent adde46d commit fae065e
Show file tree
Hide file tree
Showing 25 changed files with 1,396 additions and 725 deletions.
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,24 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.3.0] - 2024-03-25

### Added

- MED-BBK dataset
- Dataset statistics
- Support for multi-source datasets
- Multi-source case for MovieGraphBenchmark

### Changed

- entity links are now handled via eche's (Prefixed)ClusterHelper
- Very large and very small datasets only allow dask/pandas backend respectively

### Fixed

- dask/pandas backend typing

## [0.2.1] - 2023-08-09

### Fixed
Expand Down
15 changes: 0 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,21 +51,6 @@ You can get a canonical name for a dataset instance to use e.g. to create folder
'openea_d_w_15k_v1'
```

Create id-mapped dataset for embedding-based methods:

```
>>> from sylloge import IdMappedEADataset
>>> id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)
>>> id_mapped_ds
IdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)
>>> id_mapped_ds.rel_triples_right
[[26048 330 16880]
[19094 293 23348]
[16554 407 29192]
...
[16480 330 15109]
[18465 254 19956]
[26040 290 28560]]
```
You can use [dask](https://www.dask.org/) as backend for larger datasets:
Expand Down
102 changes: 51 additions & 51 deletions dataset_statistics.csv
Original file line number Diff line number Diff line change
@@ -1,51 +1,51 @@
Dataset family,Task Name,Dataset Name,Entities,Relation Triples,Attribute Triples,Relations,Properties,Literals,Matches
OpenEA,openea_d_w_15k_v1,DBpedia,15000,38265,52134,248,341,28236,15000
OpenEA,openea_d_w_15k_v1,Wikidata,15000,42746,138246,169,649,118515,15000
OpenEA,openea_d_w_15k_v2,DBpedia,15000,73983,51378,167,174,25690,15000
OpenEA,openea_d_w_15k_v2,Wikidata,15000,83365,175686,121,457,146977,15000
OpenEA,openea_d_y_15k_v1,DBpedia,15000,30291,52093,165,256,25297,15000
OpenEA,openea_d_y_15k_v1,YAGO,15000,26638,117114,28,34,105710,15000
OpenEA,openea_d_y_15k_v2,DBpedia,15000,68063,49602,72,89,22561,15000
OpenEA,openea_d_y_15k_v2,YAGO,15000,60970,116151,21,19,104546,15000
OpenEA,openea_en_de_15k_v1,DBpedia_EN,15000,47676,62403,215,285,28973,15000
OpenEA,openea_en_de_15k_v1,DBpedia_DE,15000,50419,133776,131,193,35630,15000
OpenEA,openea_en_de_15k_v2,DBpedia_EN,15000,84867,59511,169,170,23831,15000
OpenEA,openea_en_de_15k_v2,DBpedia_DE,15000,92632,161315,96,115,33185,15000
OpenEA,openea_en_fr_15k_v1,DBpedia_EN,15000,47334,57164,267,307,30281,15000
OpenEA,openea_en_fr_15k_v1,DBpedia_FR,15000,40864,54401,210,403,28760,15000
OpenEA,openea_en_fr_15k_v2,DBpedia_EN,15000,96318,52396,193,188,22761,15000
OpenEA,openea_en_fr_15k_v2,DBpedia_FR,15000,80112,56114,166,220,21645,15000
OpenEA,openea_d_w_100k_v1,DBpedia,100000,293990,334911,413,492,133931,100000
OpenEA,openea_d_w_100k_v1,Wikidata,100000,251708,687860,261,874,542921,100000
OpenEA,openea_d_w_100k_v2,DBpedia,100000,616457,360696,318,327,137483,100000
OpenEA,openea_d_w_100k_v2,Wikidata,100000,588203,878219,239,760,682367,100000
OpenEA,openea_d_y_100k_v1,DBpedia,100000,294188,360415,287,378,101386,100000
OpenEA,openea_d_y_100k_v1,YAGO,100000,400518,649787,32,37,497633,100000
OpenEA,openea_d_y_100k_v2,DBpedia,100000,576547,374785,230,276,97433,100000
OpenEA,openea_d_y_100k_v2,YAGO,100000,865265,755161,31,35,578596,100000
OpenEA,openea_en_de_100k_v1,DBpedia_EN,100000,335359,423666,381,450,147142,100000
OpenEA,openea_en_de_100k_v1,DBpedia_DE,100000,336240,586207,196,251,199527,100000
OpenEA,openea_en_de_100k_v2,DBpedia_EN,100000,622588,430752,323,325,139867,100000
OpenEA,openea_en_de_100k_v2,DBpedia_DE,100000,629395,656458,170,188,200356,100000
OpenEA,openea_en_fr_100k_v1,DBpedia_EN,100000,309607,384248,400,465,145103,100000
OpenEA,openea_en_fr_100k_v1,DBpedia_FR,100000,258285,340725,300,518,157791,100000
OpenEA,openea_en_fr_100k_v2,DBpedia_EN,100000,649902,396150,379,363,145382,100000
OpenEA,openea_en_fr_100k_v2,DBpedia_FR,100000,561391,342768,287,467,157564,100000
MED_BBK,med_bbk,MED,9162,158357,11467,32,19,10858,9162
MED_BBK,med_bbk,BBK,9162,50307,44987,20,21,36608,9162
MovieGraphBenchmark,moviegraphbenchmark_imdb_tmdb,imdb,5129,17507,20800,3,13,6082,1978
MovieGraphBenchmark,moviegraphbenchmark_imdb_tmdb,tmdb,6061,27903,23761,4,30,9991,1978
MovieGraphBenchmark,moviegraphbenchmark_imdb_tvdb,imdb,5129,17507,20800,3,13,6082,2488
MovieGraphBenchmark,moviegraphbenchmark_imdb_tvdb,tvdb,7814,15455,20902,3,9,7683,2488
MovieGraphBenchmark,moviegraphbenchmark_tmdb_tvdb,tmdb,6061,27903,23761,4,30,9991,2483
MovieGraphBenchmark,moviegraphbenchmark_tmdb_tvdb,tvdb,7814,15455,20902,3,9,7683,2483
OAEI,oaei_starwars_swg,starwars,536869,6675247,1570786,561,603,622454,1096
OAEI,oaei_starwars_swg,swg,47692,178085,76269,50,146,32765,1096
OAEI,oaei_starwars_swtor,starwars,536869,6675247,1570786,561,603,622454,1358
OAEI,oaei_starwars_swtor,swtor,22791,105543,40605,137,346,16984,1358
OAEI,oaei_marvelcinematicuniverse_marvel,marvelcinematicuniverse,216033,1094598,130517,130,110,56566,1654
OAEI,oaei_marvelcinematicuniverse_marvel,marvel,1472619,5152898,1580468,63,127,749980,1654
OAEI,oaei_memoryalpha_memorybeta,memoryalpha,254537,2096198,430730,180,287,226110,9296
OAEI,oaei_memoryalpha_memorybeta,memorybeta,212302,2048728,494181,327,332,231196,9296
OAEI,oaei_memoryalpha_stexpanded,memoryalpha,254537,2096198,430730,180,287,226110,1725
OAEI,oaei_memoryalpha_stexpanded,stexpanded,55402,412179,155207,133,194,70310,1725
Dataset family,Task Name,Dataset Name,Entities,Relation Triples,Attribute Triples,Relations,Properties,Literals,Clusters,Intra-dataset Matches,All Matches
OpenEA,openea_d_w_15k_v1,DBpedia,15000,38265,52134,248,341,28236,15000,0,15000
OpenEA,openea_d_w_15k_v1,Wikidata,15000,42746,138246,169,649,118515,15000,0,15000
OpenEA,openea_d_w_15k_v2,DBpedia,15000,73983,51378,167,174,25690,15000,0,15000
OpenEA,openea_d_w_15k_v2,Wikidata,15000,83365,175686,121,457,146977,15000,0,15000
OpenEA,openea_d_y_15k_v1,DBpedia,15000,30291,52093,165,256,25297,15000,0,15000
OpenEA,openea_d_y_15k_v1,YAGO,15000,26638,117114,28,34,105710,15000,0,15000
OpenEA,openea_d_y_15k_v2,DBpedia,15000,68063,49602,72,89,22560,15000,0,15000
OpenEA,openea_d_y_15k_v2,YAGO,15000,60970,116151,21,19,104546,15000,0,15000
OpenEA,openea_en_de_15k_v1,DBpedia_EN,15000,47676,62403,215,285,28972,15000,0,15000
OpenEA,openea_en_de_15k_v1,DBpedia_DE,15000,50419,133776,131,193,35630,15000,0,15000
OpenEA,openea_en_de_15k_v2,DBpedia_EN,15000,84867,59511,169,170,23830,15000,0,15000
OpenEA,openea_en_de_15k_v2,DBpedia_DE,15000,92632,161315,96,115,33185,15000,0,15000
OpenEA,openea_en_fr_15k_v1,DBpedia_EN,15000,47334,57164,267,307,30281,15000,0,15000
OpenEA,openea_en_fr_15k_v1,DBpedia_FR,15000,40864,54401,210,403,28760,15000,0,15000
OpenEA,openea_en_fr_15k_v2,DBpedia_EN,15000,96318,52396,193,188,22761,15000,0,15000
OpenEA,openea_en_fr_15k_v2,DBpedia_FR,15000,80112,56114,166,220,21645,15000,0,15000
OpenEA,openea_d_w_100k_v1,DBpedia,100000,293990,334911,413,492,133930,100000,0,100000
OpenEA,openea_d_w_100k_v1,Wikidata,100000,251708,687860,261,874,542921,100000,0,100000
OpenEA,openea_d_w_100k_v2,DBpedia,100000,616457,360696,318,327,137482,100000,0,100000
OpenEA,openea_d_w_100k_v2,Wikidata,100000,588203,878219,239,760,682367,100000,0,100000
OpenEA,openea_d_y_100k_v1,DBpedia,100000,294188,360415,287,378,101385,100000,0,100000
OpenEA,openea_d_y_100k_v1,YAGO,100000,400518,649787,32,37,497633,100000,0,100000
OpenEA,openea_d_y_100k_v2,DBpedia,100000,576547,374785,230,276,97432,100000,0,100000
OpenEA,openea_d_y_100k_v2,YAGO,100000,865265,755161,31,35,578595,100000,0,100000
OpenEA,openea_en_de_100k_v1,DBpedia_EN,100000,335359,423666,381,450,147141,100000,0,100000
OpenEA,openea_en_de_100k_v1,DBpedia_DE,100000,336240,586207,196,251,199527,100000,0,100000
OpenEA,openea_en_de_100k_v2,DBpedia_EN,100000,622588,430752,323,325,139866,100000,0,100000
OpenEA,openea_en_de_100k_v2,DBpedia_DE,100000,629395,656458,170,188,200356,100000,0,100000
OpenEA,openea_en_fr_100k_v1,DBpedia_EN,100000,309607,384248,400,465,145102,100000,0,100000
OpenEA,openea_en_fr_100k_v1,DBpedia_FR,100000,258285,340725,300,518,157791,100000,0,100000
OpenEA,openea_en_fr_100k_v2,DBpedia_EN,100000,649902,396150,379,363,145381,100000,0,100000
OpenEA,openea_en_fr_100k_v2,DBpedia_FR,100000,561391,342768,287,467,157564,100000,0,100000
MovieGraphBenchmark,moviegraphbenchmark_imdb_tmdb,imdb,5129,17507,20800,3,13,6082,2201,1,2237
MovieGraphBenchmark,moviegraphbenchmark_imdb_tmdb,tmdb,6061,27903,23761,4,30,9991,2201,64,2237
MovieGraphBenchmark,moviegraphbenchmark_imdb_tvdb,imdb,5129,17507,20800,3,13,6082,1483,1,25583
MovieGraphBenchmark,moviegraphbenchmark_imdb_tvdb,tvdb,7814,15455,20902,3,9,7683,1483,22663,25583
MovieGraphBenchmark,moviegraphbenchmark_tmdb_tvdb,tmdb,6061,27903,23761,4,30,9991,1920,64,26138
MovieGraphBenchmark,moviegraphbenchmark_tmdb_tvdb,tvdb,7814,15455,20902,3,9,7683,1920,22663,26138
MED_BBK,med_bbk,MED,9162,158357,11467,32,19,10858,8885,0,5619
MED_BBK,med_bbk,BBK,9162,50307,44987,20,21,36608,8885,0,5619
OAEI,oaei_marvelcinematicuniverse_marvel,marvelcinematicuniverse,216033,1094598,130517,130,110,56566,1654,0,1654
OAEI,oaei_marvelcinematicuniverse_marvel,marvel,1472619,5152898,1580468,63,127,749980,1654,0,1654
OAEI,oaei_memoryalpha_memorybeta,memoryalpha,254537,2096198,430730,180,287,226110,9296,0,9296
OAEI,oaei_memoryalpha_memorybeta,memorybeta,212302,2048728,494181,327,332,231196,9296,0,9296
OAEI,oaei_memoryalpha_stexpanded,memoryalpha,254537,2096198,430730,180,287,226110,1725,0,1725
OAEI,oaei_memoryalpha_stexpanded,stexpanded,55402,412179,155207,133,194,70310,1725,0,1725
OAEI,oaei_starwars_swg,starwars,536869,6675247,1570786,561,603,622454,1096,0,1096
OAEI,oaei_starwars_swg,swg,47692,178085,76269,50,146,32765,1096,0,1096
OAEI,oaei_starwars_swtor,starwars,536869,6675247,1570786,561,603,622454,1358,0,1358
OAEI,oaei_starwars_swtor,swtor,22791,105543,40605,137,346,16984,1358,0,1358
17 changes: 0 additions & 17 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,23 +38,6 @@ You can get a canonical name for a dataset instance to use e.g. to create folder
print(ds.canonical_name)
# 'openea_d_w_15k_v1'
Create id-mapped dataset for embedding-based methods:

.. code-block:: python
from sylloge import IdMappedEADataset
id_mapped_ds = IdMappedEADataset.from_ea_dataset(ds)
print(id_mapped_ds)
# IdMappedEADataset(rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, entity_mapping=30000, rel_mapping=417, attr_rel_mapping=990, attr_mapping=138836, folds=5)
print(id_mapped_ds.rel_triples_right)
# [[26048 330 16880]
# [19094 293 23348]
# [16554 407 29192]
# ...
# [16480 330 15109]
# [18465 254 19956]
# [26040 290 28560]]
You can use `dask <https://www.dask.org/>`_ as backend for larger datasets:

Expand Down
9 changes: 8 additions & 1 deletion docs/source/apidoc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Datasets
sylloge.MovieGraphBenchmark
sylloge.OpenEA
sylloge.OAEI
sylloge.MED_BBK


Base
Expand All @@ -20,10 +21,16 @@ Base
:nosignatures:

sylloge.base.TrainTestValSplit
sylloge.base.EADataset
sylloge.base.MultiSourceEADataset
sylloge.base.ParquetEADataset
sylloge.base.CacheableEADataset
sylloge.base.ZipEADataset
sylloge.base.ZipEADatasetWithPreSplitFolds
sylloge.base.BinaryEADataset
sylloge.base.BinaryParquetEADataset
sylloge.base.BinaryCacheableEADataset
sylloge.base.BinaryZipEADataset
sylloge.base.BinaryZipEADatasetWithPreSplitFolds


IdMapped
Expand Down
48 changes: 43 additions & 5 deletions poetry.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,13 @@ Sphinx = {version = "^5.0.0", optional = true}
insegel = {version = "^1.3.1", optional = true}
pystow = "^0.4.6"
pandas = ">=1.0"
moviegraphbenchmark = "^1.0.1"
sphinx-automodapi = {version = "^0.14.1", optional = true}
sphinx-autodoc-typehints = {version = "^1.19.2", optional = true}
python-slugify = ">=7.0.0"
dask = ">=2022.01.0"
pyarrow = "*"
moviegraphbenchmark = "^1.1.0"
eche = "^0.2.1"


[tool.poetry.group.dev.dependencies]
Expand Down
23 changes: 21 additions & 2 deletions sylloge/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,16 @@
import logging
from importlib.metadata import version # pragma: no cover

from .base import EADataset
from .base import (
BinaryEADataset,
BinaryParquetEADataset,
CacheableEADataset,
MultiSourceEADataset,
ParquetEADataset,
TrainTestValSplit,
ZipEADataset,
ZipEADatasetWithPreSplitFolds,
)
from .id_mapped import IdMappedEADataset
from .med_bbk_loader import MED_BBK
from .moviegraph_benchmark_loader import MovieGraphBenchmark
Expand All @@ -14,7 +23,17 @@
"OAEI",
"MED_BBK",
"IdMappedEADataset",
"EADataset",
"MultiSourceEADataset",
"BinaryEADataset",
"BinaryParquetEADataset",
"ParquetEADataset",
"BinaryCacheableEADataset",
"CacheableEADataset",
"BinaryZipEADataset",
"ZipEADataset",
"BinaryZipEADatasetWithPreSplitFolds",
"ZipEADatasetWithPreSplitFolds",
"TrainTestValSplit",
]
__version__ = version(__package__)
logging.getLogger(__name__).setLevel(logging.INFO)
Loading

0 comments on commit fae065e

Please sign in to comment.