Our datasets are generated as the following procedures.
We adopt the codes of DIR to generate the SPMotif datasets.
SPMotif-Struc
is basically the same as the SPMotif datasets in DIR and
can be generated by running dataset_gen/gen_struc.py
,
with a bias configuration specifying the value of global_b
:
cd dataset_gen
python gen_struc.py
The generated data will be stored as in ./data/SPMotif-{global_b}
at the root directory of this repo.
To use the dataset in main.py
, specify the --dataset
option and --bias
option as mSPMotif
and a corresponding bias, respectively.
To generate the SPMotif-Mixed
datasets, simply running the similar codes,
with a bias configuration specifying the value of global_b
:
cd dataset_gen
python gen_mixed.py
The generated data will be stored as in ./data/mSPMotif-{global_b}
at the root directory of this repo.
The gen_mixed.py
will add the graph size shifts and structure-level shifts while the ./datasets/spmotif_dataset.py
will automatically add node feature-level shifts during the data preparation.
To use the dataset in main.py
, specify the --dataset
option and --bias
option as mSPMotif
and a corresponding bias, respectively.
To obtain the DrugOOD datasets tested in our paper, i.e., drugood_lbap_core_ic50_assay
, drugood_lbap_core_ic50_scaffold
and drugood_lbap_core_ic50_size
,
we use the DrugOOD curation codes based on the commit eeb00b8da7646e1947ca7aec93041052a48bd45e
and chembl_29
database.
After curating the datasets, put the corresponding json files under ./data/DrugOOD
,
and specify the --dataset
option as the corresponding dataset name to use, e.g., drugood_lbap_core_ic50_assay
.
The CMNIST dataset is generated following the Invariant Risk Minimization and then converted into graphs using the SLIC superpixels algorithm. To generate the dataset, simply run the codes as the following:
cd dataset_gen
python prepare_mnist.py --dataset 'cmnist' -t 8 -s 'train'
python prepare_mnist.py --dataset 'cmnist' -t 8 -s 'test'
and the generated data will be put into ./data/CMNISTSP
at the root directory of this repo.
Note that two auxiliary datasets ./data/MNIST
and ./data/ColoredMNIST
will also be created as the base for the generation of ./data/CMNISTSP
.
To use the dataset, simply specify --dataset
option as CMNIST
.
Both of Graph-SST5
and Twitter
are based on the datasets provided by DIG.
To get the datasets, you may download via this link
provided by DIG and the GNN explainability survey authors.
Then unzip the data into ./data/Graph-SST2/raw
and ./data/Graph-Twitter/raw
.
By specifying --dataset
as the dataset name in main.py
, the data loading process will
add the degree biases automatically.
We use the datasets provided by size-invariant-GNNs authors,
who already sampled the datasets with graph size distribution shifts injected.
The datasets can be downloaed via this link.
After downloading, simply unzip the datasets into ./data/TU
.
To use the datasets, simply specify --dataset
as the dataset name in main.py
.