Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prep for the new release #248

Merged
merged 11 commits into from
Jun 26, 2024
52 changes: 22 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Deploy a pre-trained embedding model without writing a single line of code.
### Installation from Source
``` bash
git clone https://github.com/dice-group/dice-embeddings.git
conda create -n dice python=3.10.13 --no-default-packages && conda activate dice && cd dice-embeddings &&
conda create -n dice python=3.10.13 --no-default-packages && conda activate dice
pip3 install -e .
```
or
Expand All @@ -48,7 +48,7 @@ wget https://files.dice-research.org/datasets/dice-embeddings/KGs.zip --no-check
```
To test the Installation
```bash
python -m pytest -p no:warnings -x # Runs >114 tests leading to > 15 mins
python -m pytest -p no:warnings -x # Runs >119 tests leading to > 15 mins
python -m pytest -p no:warnings --lf # run only the last failed test
python -m pytest -p no:warnings --ff # to run the failures first and then the rest of the tests.
```
Expand Down Expand Up @@ -95,45 +95,26 @@ A KGE model can also be trained from the command line
```bash
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
dicee automaticaly detects available GPUs and trains a model with distributed data parallels technique. Under the hood, dicee uses lighning as a default trainer.
dicee automatically detects available GPUs and trains a model with distributed data parallels technique.
```bash
# Train a model by only using the GPU-0
CUDA_VISIBLE_DEVICES=0 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by only using GPU-1
CUDA_VISIBLE_DEVICES=1 dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Train a model by using all available GPUs
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
```
Under the hood, dicee executes run.py script and uses lighning as a default trainer
Under the hood, dicee executes the run.py script and uses [lightning](https://lightning.ai/) as a default trainer.
```bash
# Two equivalent executions
# (1)
dicee --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072362996241839}
# Evaluate Keci on Test set: Evaluate Keci on Test set
# {'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}

# (2)
CUDA_VISIBLE_DEVICES=0,1 python dicee/scripts/run.py --trainer PL --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Train set: Evaluate Keci on Train set
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072362996241839}
# Evaluate Keci on Test set: Evaluate Keci on Test set
# {'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}
```
Similarly, models can be easily trained with torchrun
```bash
torchrun --standalone --nnodes=1 --nproc_per_node=gpu dicee/scripts/run.py --trainer torchDDP --dataset_dir "KGs/UMLS" --model Keci --eval_model "train_val_test"
# Evaluate Keci on Train set: Evaluate Keci on Train set: Evaluate Keci on Train set
# {'H@1': 0.9518788343558282, 'H@3': 0.9988496932515337, 'H@10': 1.0, 'MRR': 0.9753123402351737}
# Evaluate Keci on Validation set: Evaluate Keci on Validation set
# {'H@1': 0.6932515337423313, 'H@3': 0.9041411042944786, 'H@10': 0.9754601226993865, 'MRR': 0.8072499937521418}
# Evaluate Keci on Test set: Evaluate Keci on Test set
{'H@1': 0.6951588502269289, 'H@3': 0.9039334341906202, 'H@10': 0.9750378214826021, 'MRR': 0.8064032293278861}
```
You can also train a model in multi-node multi-gpu setting.
```bash
Expand All @@ -143,7 +124,7 @@ torchrun --nnodes 2 --nproc_per_node=gpu --node_rank 1 --rdzv_id 455 --rdzv_bac
Train a KGE model by providing the path of a single file and store all parameters under newly created directory
called `KeciFamilyRun`.
```bash
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib
dicee --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```
where the data is in the following form
```bash
Expand All @@ -152,6 +133,11 @@ _:1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07
<http://www.benchmark.org/family#hasChild> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
<http://www.benchmark.org/family#hasParent> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#ObjectProperty> .
```
**Continual Training:** the training phase of a pretrained model can be resumed.
```bash
dicee --continual_learning KeciFamilyRun --path_single_kg "KGs/Family/family-benchmark_rich_background.owl" --model Keci --path_to_store_single_run KeciFamilyRun --backend rdflib --eval_model None
```

**Apart from n-triples or standard link prediction dataset formats, we support ["owl", "nt", "turtle", "rdf/xml", "n3"]***.
Moreover, a KGE model can be also trained by providing **an endpoint of a triple store**.
```bash
Expand Down Expand Up @@ -285,16 +271,22 @@ pre_trained_kge.predict_topk(r=[".."],t=[".."],topk=10)

## Downloading Pretrained Models

We provide plenty pretrained knowledge graph embedding models at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/).
<details> <summary> To see a code snippet </summary>

```python
from dicee import KGE
# (1) Load a pretrained ConEx on DBpedia
model = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/KINSHIP-Keci-dim128-epoch256-KvsAll")
mure = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_MuRE-dim128-epoch256-KvsAll")
quate = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Pykeen_QuatE-dim128-epoch256-KvsAll")
keci = KGE(url="https://files.dice-research.org/projects/DiceEmbeddings/YAGO3-10-Keci-dim128-epoch256-KvsAll")
quate.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9894362688064575), ('Europe', 0.01575559377670288), ('Tadanari_Lee', 0.012544365599751472)]
keci.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.6522021293640137), ('Chinggis_Khaan_International_Airport', 0.36563414335250854), ('Democratic_Party_(Mongolia)', 0.19600993394851685)]
mure.predict_topk(h=["Mongolia"],r=["isLocatedIn"],topk=3)
# [('Asia', 0.9996906518936157), ('Ulan_Bator', 0.0009907372295856476), ('Philippines', 0.0003116439620498568)]
```

- For more please look at [dice-research.org/projects/DiceEmbeddings/](https://files.dice-research.org/projects/DiceEmbeddings/)

</details>

## How to Deploy
Expand Down
2 changes: 2 additions & 0 deletions dicee/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@ def __init__(self, **kwargs):
self.block_size: int = None
"block size of LLM"

self.continual_learning=None
"Path of a pretrained model size of LLM"

def __iter__(self):
# Iterate
Expand Down
2 changes: 1 addition & 1 deletion dicee/evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ def dummy_eval(self, trained_model, form_of_labelling: str):
valid_set=valid_set,
test_set=test_set,
trained_model=trained_model)
elif self.args.scoring_technique in ['KvsAll', 'KvsSample', '1vsAll', 'PvsAll', 'CCvsAll']:
elif self.args.scoring_technique in ["AllvsAll",'KvsAll', 'KvsSample', '1vsAll']:
self.eval_with_vs_all(train_set=train_set,
valid_set=valid_set,
test_set=test_set,
Expand Down
31 changes: 16 additions & 15 deletions dicee/executer.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,31 +234,32 @@ class ContinuousExecute(Execute):
(1) Loading & Preprocessing & Serializing input data.
(2) Training & Validation & Testing
(3) Storing all necessary info

During the continual learning we can only modify *** num_epochs *** parameter.
Trained model stored in the same folder as the seed model for the training.
Trained model is noted with the current time.
"""

def __init__(self, args):
assert os.path.exists(args.path_experiment_folder)
assert os.path.isfile(args.path_experiment_folder + '/configuration.json')
# (1) Load Previous input configuration
previous_args = load_json(args.path_experiment_folder + '/configuration.json')
dargs = vars(args)
del args
for k in list(dargs.keys()):
if dargs[k] is None:
del dargs[k]
# (2) Update (1) with new input
previous_args.update(dargs)
# (1) Current input configuration.
assert os.path.exists(args.continual_learning)
assert os.path.isfile(args.continual_learning + '/configuration.json')
# (2) Load previous input configuration.
previous_args = load_json(args.continual_learning + '/configuration.json')
args=vars(args)
#
previous_args["num_epochs"]=args["num_epochs"]
previous_args["continual_learning"]=args["continual_learning"]
print("Updated configuration:",previous_args)
try:
report = load_json(dargs['path_experiment_folder'] + '/report.json')
report = load_json(args['continual_learning'] + '/report.json')
previous_args['num_entities'] = report['num_entities']
previous_args['num_relations'] = report['num_relations']
except AssertionError:
print("Couldn't find report.json.")
previous_args = SimpleNamespace(**previous_args)
previous_args.full_storage_path = previous_args.path_experiment_folder
print('ContinuousExecute starting...')
print(previous_args)
# TODO: can we remove continuous_training from Execute ?
super().__init__(previous_args, continuous_training=True)

def continual_start(self) -> dict:
Expand All @@ -279,7 +280,7 @@ def continual_start(self) -> dict:
"""
# (1)
self.trainer = DICE_Trainer(args=self.args, is_continual_training=True,
storage_path=self.args.path_experiment_folder)
storage_path=self.args.continual_learning)
# (2)
self.trained_model, form_of_labelling = self.trainer.continual_start()

Expand Down
1 change: 1 addition & 0 deletions dicee/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
from .clifford import Keci, KeciBase, CMult, DeCaL # noqa
from .pykeen_models import * # noqa
from .function_space import * # noqa
from .dualE import DualE
2 changes: 2 additions & 0 deletions dicee/models/base_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -431,6 +431,8 @@ class IdentityClass(torch.nn.Module):
def __init__(self, args=None):
super().__init__()
self.args = args
def __call__(self, x):
return x

@staticmethod
def forward(x):
Expand Down
88 changes: 60 additions & 28 deletions dicee/models/clifford.py
Original file line number Diff line number Diff line change
Expand Up @@ -764,7 +764,7 @@ def forward_triples(self, x: torch.Tensor) -> torch.FloatTensor:

Parameter
---------
x: torch.LongTensor with (n,3) shape
x: torch.LongTensor with (n, ) shape

Returns
-------
Expand Down Expand Up @@ -844,9 +844,9 @@ def forward_triples(self, x: torch.Tensor) -> torch.FloatTensor:
sigma_qr = 0
return h0r0t0 + score_p + score_q + score_r + sigma_pp + sigma_qq + sigma_rr + sigma_pq + sigma_qr + sigma_pr

def cl_pqr(self, a):
def cl_pqr(self, a:torch.tensor)->torch.tensor:

''' Input: tensor(batch_size, emb_dim) ----> output: tensor with 1+p+q+r components with size (batch_size, emb_dim/(1+p+q+r)) each.
''' Input: tensor(batch_size, emb_dim) ---> output: tensor with 1+p+q+r components with size (batch_size, emb_dim/(1+p+q+r)) each.

1) takes a tensor of size (batch_size, emb_dim), split it into 1 + p + q +r components, hence 1+p+q+r must be a divisor
of the emb_dim.
Expand All @@ -861,17 +861,25 @@ def cl_pqr(self, a):
def compute_sigmas_single(self, list_h_emb, list_r_emb, list_t_emb):

'''here we compute all the sums with no others vectors interaction taken with the scalar product with t, that is,
1) s0 = h_0r_0t_0
2) s1 = \sum_{i=1}^{p}h_ir_it_0
3) s2 = \sum_{j=p+1}^{p+q}h_jr_jt_0
4) s3 = \sum_{i=1}^{q}(h_0r_it_i + h_ir_0t_i)
5) s4 = \sum_{i=p+1}^{p+q}(h_0r_it_i + h_ir_0t_i)
5) s5 = \sum_{i=p+q+1}^{p+q+r}(h_0r_it_i + h_ir_0t_i)

.. math::

s0 = h_0r_0t_0
s1 = \sum_{i=1}^{p}h_ir_it_0
s2 = \sum_{j=p+1}^{p+q}h_jr_jt_0
s3 = \sum_{i=1}^{q}(h_0r_it_i + h_ir_0t_i)
s4 = \sum_{i=p+1}^{p+q}(h_0r_it_i + h_ir_0t_i)
s5 = \sum_{i=p+q+1}^{p+q+r}(h_0r_it_i + h_ir_0t_i)

and return:

*) sigma_0t = \sigma_0 \cdot t_0 = s0 + s1 -s2
*) s3, s4 and s5'''
.. math::

sigma_0t = \sigma_0 \cdot t_0 = s0 + s1 -s2
s3, s4 and s5


'''

p = self.p
q = self.q
Expand Down Expand Up @@ -906,15 +914,19 @@ def compute_sigmas_multivect(self, list_h_emb, list_r_emb):

For same bases vectors interaction we have

1) \sigma_pp = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(h_ir_{i'}-h_{i'}r_i) (models the interactions between e_i and e_i' for 1 <= i, i' <= p)
2) \sigma_qq = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(h_jr_{j'}-h_{j'} (models the interactions between e_j and e_j' for p+1 <= j, j' <= p+q)
3) \sigma_rr = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(h_kr_{k'}-h_{k'}r_k) (models the interactions between e_k and e_k' for p+q+1 <= k, k' <= p+q+r)

.. math::

\sigma_pp = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(h_ir_{i'}-h_{i'}r_i) (models the interactions between e_i and e_i' for 1 <= i, i' <= p)
\sigma_qq = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(h_jr_{j'}-h_{j'} (models the interactions between e_j and e_j' for p+1 <= j, j' <= p+q)
\sigma_rr = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(h_kr_{k'}-h_{k'}r_k) (models the interactions between e_k and e_k' for p+q+1 <= k, k' <= p+q+r)

For different base vector interactions, we have

4) \sigma_pq = \sum_{i=1}^{p}\sum_{j=p+1}^{p+q}(h_ir_j - h_jr_i) (interactionsn between e_i and e_j for 1<=i <=p and p+1<= j <= p+q)
5) \sigma_pr = \sum_{i=1}^{p}\sum_{k=p+q+1}^{p+q+r}(h_ir_k - h_kr_i) (interactionsn between e_i and e_k for 1<=i <=p and p+q+1<= k <= p+q+r)
6) \sigma_qr = \sum_{j=p+1}^{p+q}\sum_{j=p+q+1}^{p+q+r}(h_jr_k - h_kr_j) (interactionsn between e_j and e_k for p+1 <= j <=p+q and p+q+1<= j <= p+q+r)
.. math::

\sigma_pq = \sum_{i=1}^{p}\sum_{j=p+1}^{p+q}(h_ir_j - h_jr_i) (interactionsn between e_i and e_j for 1<=i <=p and p+1<= j <= p+q)
\sigma_pr = \sum_{i=1}^{p}\sum_{k=p+q+1}^{p+q+r}(h_ir_k - h_kr_i) (interactionsn between e_i and e_k for 1<=i <=p and p+q+1<= k <= p+q+r)
\sigma_qr = \sum_{j=p+1}^{p+q}\sum_{j=p+q+1}^{p+q+r}(h_jr_k - h_kr_j) (interactionsn between e_j and e_k for p+1 <= j <=p+q and p+q+1<= j <= p+q+r)

'''

Expand Down Expand Up @@ -958,15 +970,15 @@ def forward_k_vs_all(self, x: torch.Tensor) -> torch.FloatTensor:
"""
Kvsall training

(1) Retrieve real-valued embedding vectors for heads and relations \mathbb{R}^d .
(2) Construct head entity and relation embeddings according to Cl_{p,q}(\mathbb{R}^d) .
(1) Retrieve real-valued embedding vectors for heads and relations
(2) Construct head entity and relation embeddings according to Cl_{p,q, r}(\mathbb{R}^d) .
(3) Perform Cl multiplication
(4) Inner product of (3) and all entity embeddings

forward_k_vs_with_explicit and this funcitons are identical
Parameter
---------
x: torch.LongTensor with (n,2) shape
x: torch.LongTensor with (n, ) shape
Returns
-------
torch.FloatTensor with (n, |E|) shape
Expand Down Expand Up @@ -1097,9 +1109,12 @@ def construct_cl_multivector(self, x: torch.FloatTensor, re: int, p: int, q: int

def compute_sigma_pp(self, hp, rp):
"""
\sigma_{p,p}^* = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(x_iy_{i'}-x_{i'}y_i)
Compute
.. math::

\sigma_{p,p}^* = \sum_{i=1}^{p-1}\sum_{i'=i+1}^{p}(x_iy_{i'}-x_{i'}y_i)

sigma_{pp} captures the interactions between along p bases
\sigma_{pp} captures the interactions between along p bases
For instance, let p e_1, e_2, e_3, we compute interactions between e_1 e_2, e_1 e_3 , and e_2 e_3
This can be implemented with a nested two for loops

Expand All @@ -1125,7 +1140,12 @@ def compute_sigma_pp(self, hp, rp):

def compute_sigma_qq(self, hq, rq):
"""
Compute \sigma_{q,q}^* = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(x_jy_{j'}-x_{j'}y_j) Eq. 16
Compute

.. math::

\sigma_{q,q}^* = \sum_{j=p+1}^{p+q-1}\sum_{j'=j+1}^{p+q}(x_jy_{j'}-x_{j'}y_j) Eq. 16

sigma_{q} captures the interactions between along q bases
For instance, let q e_1, e_2, e_3, we compute interactions between e_1 e_2, e_1 e_3 , and e_2 e_3
This can be implemented with a nested two for loops
Expand Down Expand Up @@ -1157,7 +1177,9 @@ def compute_sigma_qq(self, hq, rq):

def compute_sigma_rr(self, hk, rk):
"""
\sigma_{r,r}^* = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(x_ky_{k'}-x_{k'}y_k)
.. math::

\sigma_{r,r}^* = \sum_{k=p+q+1}^{p+q+r-1}\sum_{k'=k+1}^{p}(x_ky_{k'}-x_{k'}y_k)

"""
# Compute indexes for the upper triangle of p by p matrix
Expand All @@ -1173,7 +1195,11 @@ def compute_sigma_rr(self, hk, rk):

def compute_sigma_pq(self, *, hp, hq, rp, rq):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
Compute

.. math::

\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j

results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand All @@ -1189,7 +1215,11 @@ def compute_sigma_pq(self, *, hp, hq, rp, rq):

def compute_sigma_pr(self, *, hp, hk, rp, rk):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
Compute

.. math::

\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j

results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand All @@ -1205,7 +1235,9 @@ def compute_sigma_pr(self, *, hp, hk, rp, rk):

def compute_sigma_qr(self, *, hq, hk, rq, rk):
"""
\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j
.. math::

\sum_{i=1}^{p} \sum_{j=p+1}^{p+q} (h_i r_j - h_j r_i) e_i e_j

results = []
sigma_pq = torch.zeros(b, r, p, q)
Expand Down
Loading
Loading