Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build data set #21

Open
Yong-Q opened this issue Jun 27, 2024 · 5 comments
Open

Build data set #21

Yong-Q opened this issue Jun 27, 2024 · 5 comments

Comments

@Yong-Q
Copy link

Yong-Q commented Jun 27, 2024

Can you share your approach to building the dataset? For example, write json from a cif structure

@qoffee
Copy link

qoffee commented Aug 5, 2024

Hello! We want to use your model, but currently don't understand how to build the dataset to train the predictor and generator. Can you share the code to create the training data from the CIF files?

@hspark1212
Copy link
Owner

@qoffee @Yong-Q Sorry for the late reply.

I've pushed the data preparation notebook file in this commit. I must admit that the process of constructing the dataset for the generator, predictor, and reinforcement learning is quite complex. To help with this, I've shared the Google Drive link containing the data we used.

Since the generator is already pretrained with this dataset, you won't need to train it again. However, if you want to run reinforcement learning with different properties, you can do so after training a new predictor to predict the desired properties.

@qoffee
Copy link

qoffee commented Aug 9, 2024

@hspark1212 thanks, I looked at the notebook, but it is still not obvious to me how to get the correct input in the format:
[topo, mc, ol, num_conn, frags, selfies] which are in your dataset. Is it possible to share a small snippet of how to get the necessary information from the input .cif file?

@hspark1212
Copy link
Owner

Hi @qoffee

Unfortunately, converting a CIF file to the required input format presents several challenges:

The structure of the CIF file needs to be decomposed into topology and building blocks that are compatible with the PORMAKE building blocks, as these are represented by categorical variables.
The organic linkers within the CIF file must also be extracted and converted into SELFIES.
Due to these complexities, we decided to construct a new dataset for this work rather than relying on pre-existing databases like the CoRE MOF database.

As outlined in the paper, the dataset was created through the following procedures:

Topology: Sourced from the RCSC database.
Metal Clusters: Extracted from the building blocks database within pormake, including metals.
Organic Linkers: Taken from the building blocks database in pormake, excluding metals, and supplemented with merged building blocks for data augmentation.
This makes training reinforcement learning models on structures outside of the dataset used quite tricky.

Thanks,

@Yong-Q
Copy link
Author

Yong-Q commented Aug 16, 2024

The database may not be the most important. What is needed is the process from a cif to its feature extraction. topo+ Node.cif can be obtained through pormake, and other features such as smile format conversion will be more important, as well as the digitization of topo/node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants