Skip to content

Dataset and curation

Chris Iacovella edited this page Jan 16, 2025 · 10 revisions

Dataset module

The dataset module provides functions and classes to load datasets from curated HDF5 files, saving to torch.DataSet or LighningDataSet instances in order to train NNPs. The dataset module implements actions associated with data storage, caching, and retrieval, as well as the pipeline from the stored hdf5 files to the pytorch dataset class that can be used for training. The general workflow to interact with public datasets will be as follows:

  1. obtaining the dataset
  2. processing the dataset and storing it in a hdf5 file with standard naming and units
  3. uploading to Zenodo and updating the retrieval link in the dataset implementation

The specific dataset classes like QM9Dataset or SPICEDataset download a hdf5 file with defined key names and values in a particular format from Zenodo and load the data in memory. The values in the dataset need to be specified in the [openMM unit system](http://docs.openmm.org/6.2.0/userguide/theory.html#units).

The public API for creating a TorchDataset is implemented in the specific data classes (e.g., QM9Dataset) and the DatasetFactory. The TorchDataset can be loaded in a Pytorch Dataloader.

modelforge
  datasets/ # defines the interaction with public datasets 
    dataset.py
      TorchDataset(torch.utils.data.Dataset) # A custom dataset class to wrap numpy datasets for PyTorch.
	* __init__(self, dataset: np.ndarray, property_name: PropertyNames)
	* __len__()
	* __getitem__(self, idx:int) -> Dict[str, torch.Tensor]
      HDF5Dataset() # Base class for data stored in HDF5 format.
        * _{to|from}_file_cache() # write/read high-performance numpy cache file (can change a lot)
        * _{from}_hdf5() # read our HDF5 format (reproducible and archival) (also supports gzipped files)
	* _perform_transformations(label_transform: Optional[Dict[str, Callable]], transforms: Dict[str, Callable]) # transform any entry of the dataset using a custom function
      DatasetFactory # Factory class for creating Dataset instances.
	* create_dataset(data: HDF5Dataset, label_transform: Optional[Dict[str, Callable]],transform: Optional[Dict[str, Callable]]) -> TorchDataset
	# Creates a TorchDataset instance given an HDF5Dataset.
      TorchDataModule(pl.LighningDataModule) # A custom data module class to handle data loading and preparation for PyTorch Lightning training.
	* def __init__(self, data: HDF5Dataset, SplittingStrategy: SplittingStrategy,batch_size)
	* prepare_data()
	* setup()
	* {train|val|test}_dataloader()-> DataLoader					
    {qm9|spice|phalkethoh|ani1x|ani2x|tmqm}.py
      QM9Dataset(HDF5Dataset) # Data class for handling QM9 data.
      * properties_of_interest() -> List[str] # [getter|setter], entries in dataset that are retrieved  
      * available_properties() -> List[str] # list of available properties in the dataset
      * _download() # Download the hdf5 file containing the data from source.
    transformation.py # transformation functions applied to entries in dataset
    * default_transformations 
    utils.py
    RandomSplittingStrategy(SplittingStrategy)
    * split(dataset:TorchDataset) -> Tuple[Subset, Subset, Subset] # Splits the provided dataset into training, validation, and testing subsets

Curation module

The curation module provides functionality to retrieve source datasets and generate HDF5 datafiles with a consistent format and units, to be loaded by the dataset module.

Note: This is currently being refactored to provide a cleaner interface with improved validation.

The purpose of including this module in the package is to encapsulate all routines used to generate the input datafile, including any and all manipulation of the underlying data (e.g., unit conversion, summing of quantities, calculation of reference energy, etc.), to ensure transparency and reproducibility.

The HDF5 files generated by the curation module have units (with openff-units compatible names) defined in the "u" attribute for each quantity. For efficient data writing/reading conformers are grouped together into a single entry.

Furthermore, a descriptor is provided for each quantity in a given record, that informs the dataset module how to parse the underlying arrays. This description allows us to understand what the axes of each quantity we load represents, rather than attempting to infer this information or hard code it in. This allows the dataloader to be more general and thus work with different datasets where the names of the underlying quantities may vary. The descriptor is simply two strings concatenated together. The first string tells us how to handle axis=0, with options are series or single; series indicates that we will loop over axis=0 to retrieve information for each conformer, whereas single tells us this quantity applies to all conformers. The second string tells us about axis=1 (if available), and has options of rec, mol or atom; rec tells us that the information is a descriptor for the entire record and the quantity is not stored in as an array (e.g., SMILES string or molecular formula); mol tells us that whatever quantity is encoded, is calculated on a per-molecule basis (e.g., energy); and atom tells us that the underlying quantity is a per-atom property (e.g., partial charge).

The possible combinations, with examples:

  • "single_rec": states that the quantity encodes a single value that is applicable to all conformers in the record, e.g., molecular formula or SMILES string.
  • "single_mol" states this quantity applies to all conformers, and that the underlying value is per-molecule, e.g., reference energy.
  • "single_atom" states this quantity applies to all conformers, with atom-wise values encoded, e.g., the atomic numbers (for methane the underlying array would be [[6],[1],[1],[1],[1]] with shape [n_atoms,1])
  • "series_mol" states that the quantity of interest depends on the conformer (i.e., axis=0 will allow us to index into different conformers) and the values are per-molecule, e.g., energy. This will be of shape [n_configs, x] where n_configs is the number of conformers and X denotes variable dimension (e.g., a quantity such as energy would have x=1, but a rotational constant would have x=3).
  • "series_atom" status the quantity of interest depends on the conformer, and the values are per-atom e.g., partial charges. This will be of shape [n_configs, n_atoms, x] where again x denotes a variable size.

Note: this is currently being refactored such that there will be two main classifiers "per_atom" and "per_system".

Data format:

As an example, let us load the first record for the QM9 dataset

from modelforge.curation.qm9_curation import QM9Curation

qm9_dataset = QM9Curation(
    hdf5_file_name="qm9_dataset.hdf5",
    output_file_dir="datasets/hdf5_files",
    local_cache_dir="datasets/qm9_dataset_raw",
)

qm9_dataset.process(max_records=1)

In all the curated datasets, a list named data is generated. Each entry in the list corresponds to a specific molecule, where the molecule information is stored as a dictionary.

For example, we can access all of the properties stored in the dataset as follows:

for data_point in qm9_dataset.data:
    for key, val in data_point.items():
        print(f"{key} : {val} : {qm9_dataset._record_entries_series[key]}")

Note this also accesses the _record_entries_series dictionary in the dataset, which stores the descriptor discussed above.

Let us examine a small selection of the stored data to discuss the specific format and common elements between all datasets.

In all datasets, each entry in the data list will contain several keys:

  • name -- unique identifying string of the molecule, typically taken from the original dataset
  • n_configs -- number of configurations/conformers for the molecule
  • atomic_numbers -- array of atomic numbers (in order) of the molecule.
  • geometry -- array of atomic positions of the conformers

name and n_configs are both considered to be of format single_rec (see above) as these values apply to all data in the molecule and are not conformer dependent.

name : dsgdb9nsd_000001 : <class 'str'> : single_rec
n_configs : 1 : <class 'int'> : single_rec

atomic_numbers is marked as a single_atom, as this array applies to all conformers (order of the atomic indices cannot change), but is also a per-atom property, hence why we consider it a single_atom as opposed to single_rec. Note as can be seen below, the shape of atomic_numbers is (n_atoms,1), where in this case n_atoms=5. We defined this as (n_atoms, 1) instead of (n_atoms) for consistency with other per-atom properties:

atomic_numbers : 
[
 [6]
 [1]
 [1]
 [1]
 [1]
] 
<class 'numpy.ndarray'>
single_atom

The geometry is of format series_atom as we will have a unique set of coordinates for each conformer. This is of shape (n_configs, n_atoms, 3), which since n_configs=1, is of shape (1,5,3). Note that this is a numpy.ndarray with units attached (using openff-units, based on pint).

geometry : 
[[
  [-0.0012698135899999999 0.10858041577999998 0.00080009958]
  [0.00021504159999999998 -0.0006031317599999999 0.00019761204]
  [0.10117308433 0.14637511618 2.7657479999999996e-05]
  [-0.05408150689999999 0.14475266137999998 -0.08766437151999999]
  [-0.05238136344999999 0.14379326442999998 0.09063972942]
]] nanometer : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_atom

Note, for data of format series_atom, the final dimension is variable. For example, the charges in this dataset are series_atom, but only a single charge is associated with each atom, rather than a vector of a shape 3. Hence, we have an entry of shape (n_configs, n_atoms, 1).

charges : 
[[
 [-0.535689]  
 [0.133921]  
 [0.133922]  
 [0.133923]  
 [0.133923]
]] elementary_charge : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_atom

Datasets will also contain information about the energy, although the name of this will depend on the dataset itself. For example, in QM9, we have internal_energy_at_0K, which is of format series_mol, meaning there will be a single unique value for each conformer, hence of shape (n_configs, 1) in this case.

internal_energy_at_0K : 
[
 [-106277.4161215308]
] kilojoule_per_mole : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_mol
`

Again, as the last dimension of the shape of series_mol entries are variable (and will be inferred during data load), and can represent not just a single float value per molecule, but also a vector. For example, harmonic vibrational frequencies is of length (n_configs, 9) in this case:

harmonic_vibrational_frequencies : 
[
 [1341.307 1341.3284 1341.365 1562.6731 1562.7453 3038.3205 3151.6034  3151.6788 3151.7078]
] / centimeter : 
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> : 
series_mol

This data array, along with the "format" information, is written to an HDF5 file, in roughly the same general structure. HDF5 files can be accessed in a very similar fashion to dictionaries using h5py. The key differences in the datastructure are as follows: the name field is used to create a top level key in the HDF5 datastructure, with properties stored the level below this. Units are no longer attached to values/arrays, but instead stored in the attributes (attrs) associated with each property; the format (e.g., series_mol) is also stored as an attribute. A sketch of the hierarchy is as follows:

1- name
2-- property
3--- attrs: units as "u", format

The following script demonstrates how to access the data (although in general, users will not need to directly access files, as these will be automatically loaded in the dataset classes).

import h5py

filename = "datasets/hdf5_files/qm9_dataset.hdf5"

with h5py.File(filename) as h5:
    for molecule_name in h5.keys():
        print("molecule_name:", molecule_name)

        for property in h5[molecule_name].keys():
            print("-Property:", property)
            print(h5[molecule_name][property].attrs["format"])
            if "rec" not in h5[molecule_name][property].attrs["format"]:
                print(h5[molecule_name][property].shape)
            print(h5[molecule_name][property][()])
            if "u" in h5[molecule_name][property].attrs:
                print(h5[molecule_name][property].attrs["u"])

The first few outputs are as follows:

molecule_name: dsgdb9nsd_000001
-Property: atomic_numbers
single_atom
(5, 1)
[[6]
 [1]
 [1]
 [1]
 [1]]
-Property: charges
series_atom
(1, 5, 1)
[[[-0.535689]
  [ 0.133921]
  [ 0.133922]
  [ 0.133923]
  [ 0.133923]]]
elementary_charge

Note, in this format, units are written as strings; openff units allows these to be easily reattached to the quantity of interest, simply by passing the string to Quantity.

from openff.units import Quantity

value_without_units = h5[molecule_name][property][()]
units_string = h5[molecule_name][property].attrs["u"]

value_with_units =  value_without_units* Quantity(units_string)

PhalKetHOH

image

ANI2x

image

QM9

image

SPICE2

image

Clone this wiki locally