Skip to content

Latest commit

 

History

History
99 lines (81 loc) · 6.18 KB

README.md

File metadata and controls

99 lines (81 loc) · 6.18 KB

MTransformer

Materials Transformers

Ciation: Fu, Nihang, Lai Wei, Yuqi Song, Qinyang Li, Rui Xin, Sadman Sadeed Omee, Rongzhi Dong, Edirisuriya M. Dilanga Siriwardane, and Jianjun Hu. "Material transformers: deep learning language models for generative materials design." Machine Learning: Science and Technology 4, no. 1 (2023): 015001. PDF

by Machine Learning and Evolution Laboratory, University of South Carolina

Benchmark Datasets for training inorganic materials composition transformers

ICSD-mix dataset (52317 samples)

ICSD-pure dataset (39431 samples)

Hybrid-mix dataset (418983 samples)

Hybrid-pure dataset (257138 samples)

Hybrid-strict dataset (212778 samples)

All above datasets can be downloaded from Figshare

Trained Materials Transformer Models

ICSD-mix ICSD-pure Hybrid-mix Hybrid-pure Hybrid-strict
MT-GPT GPT-Im GPT-Ip GPT-Hm GPT-Hp GPT-Hs
MT-GPT2 GPT2-Im GPT2-Ip GPT2-Hm GPT2-Hp GPT2-Hs
MT-GPTJ GPTJ-Im GPTJ-Ip GPTJ-Hm GPTJ-Hp GPTJ-Hs
MT-GPTNeo GPTNeo-Im GPTNeo-Ip GPTNeo-Hm GPTNeo-Hp GPTNeo-Hs
MT-BART BART-Im BART-Ip BART-Hm BART-Hp BART-Hs
MT-RoBERTa RoBERTa-Im RoBERTa-Ip RoBERTa-Hm RoBERTa-Hp RoBERTa-Hs

How to train with your own dataset

Installation

  1. Create your own conda or other enviroment.
  2. install basic packages
pip install -r requirements.txt
  1. Install pytorch from pytorch web given your python & cuda version

Data preparation

Download datasets from the above link, then unzip it under MT_dataset folder. After the above, the directory should be:

MTransformer
   ├── MT_dataset
       ├── hy_mix
           ├── test.txt
           ├── train.txt
           ├── valid.txt
       ├── hy_pure
       ├── hy_strict
       ├── icsd_mix
       ├── icsd_pure
       ├── mp
   ├── MT_models
       ├── MT_Bart
           ├── hy_mix
               ├── config.json
               ├── pytorch_model.bin
               ├── training_args.bin
           ├── hy_pure
           ├── hy_strict
           ├── icsd_mix
           ├── icsd_pure
       ├── MT_GPT
       ├── MT_GPT2
       ├── MT_GPTJ
       ├── MT_GPTNeo
       ├── MT_RoBERTa
       ├── tokenizer
           ├── vocab.txt       
   ├── generateFormula_random.py
   ├── multi_generateFormula_random.py
   ├── README.md
   └── requirements.txt

Training

An example is to train a MT-GPT model on the Hybrid-mix dataset.

python ./MT_model/MT_GPT/train_GPT.py  --tokenizer ./MT_model/tokenizer/   --train_data  ./MT_Dataset/hy_mix/train.txt  --valid_data ./MT_Dataset/hy_mix/valid.txt  --output_dir ./output

The training for other models is similar to MT-GPT.

How to generate new materials compositions/formula using the trained models

Download models from the above link or use your own trianed models, then put them into correspoding folders.

Generate materials formulas using the trained MT-GPT model.

python generateFormula_random.py  --tokenizer ./MT_model/tokenizer  --model_name OpenAIGPTLMHeadModel  --model_path ./MT_model/MT_GPT2/hy_mix

We also provide the multi-thread generation. The default number of threads is 10, and you can change it using arg n_thread.

python multi_generateFormula_random.py  --tokenizer ./tokenizer  --model_name GPT2LMHeadModel  --model_path ./MT_GPT2/hy_mix  --n_thread 5