Large Dataset Pre-processing
HDF5 files preprocessing
If you have a large dataset that might not fit into the GPU memory it is recommended to preprocess the data on a CPU and use on-line dataloading for training the model.
To preprocess your dataset specified as an xyz file run the preprocess_data.py
script.
An example is given here:
mkdir processed_data
python <mace_repo_dir>/mace/cli/preprocess_data.py \
--train_file="/path/to/train_large.xyz" \
--valid_fraction=0.05 \
--test_file="/path/to/test_large.xyz" \
--atomic_numbers="[1, 6, 7, 8, 9, 15, 16, 17, 35, 53]" \
--r_max=4.5 \
--h5_prefix="processed_data/" \
--compute_statistics \
--E0s="average" \
--seed=123 \
This will create a directory processed_data with the preprocessed data as h5 files.
There will be one folder for training, one for validation and a separate one for each config_type in the test set.
To see all options and a little description of them run python ./mace/scripts/preprocess_data.py --help
.
The statistics of the dataset will be saved in a json file in the same directory.
The preprocessed data can be used for training the model using the on-line dataloader as shown in the example below. If the pre-processing is done on multiple threads, you should end up with multiple files in the processed_data directory. For example, if you have 4 threads, you should see the following files:
ls processed_data/
statistics.json test train val
ls processed_data/test/
ls processed_data/train/
train_0.h5 train_1.h5 train_2.h5 train_3.h5
ls processed_data/val/
val_0.h5 val_1.h5 val_2.h5 val_3.h5
In this case, you should use the following command to train the model:
python <mace_repo_dir>/mace/cli/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train" \
--valid_file="./processed_data/val" \
--test_dir="./processed_data/test" \
--statistics_file="./processed_data/statistics.json" \
--model="ScaleShiftMACE" \
--num_interactions=2 \
--num_channels=128 \
--max_L=1 \
--correlation=3 \
--batch_size=32 \
--valid_batch_size=32 \
--max_num_epochs=100 \
--swa \
--start_swa=60 \
--ema \
--ema_decay=0.99 \
--amsgrad \
--error_table='PerAtomMAE' \
--device=cuda \
--seed=123 \
LMDB files preprocessing
Note
This feature is available from version 0.3.11.
MACE support the LMDB data format from fairchem for large dataset pre-processing, such as OMAT. Here is an example YAML file for the LMDB data format:
heads:
omat_pbe_refit:
train_file: "/omat/train/aimd-from-PBE-1000-npt:/omat/train/aimd-from-PBE-3000-nvt:/omat/train/rattled-1000-subsampled:/omat/train/rattled-300:/omat/train/aimd-from-PBE-3000-npt:/omat/train/rattled-300-subsampled:/omat/train/rattled-500:/omat/train/aimd-from-PBE-1000-nvt:/omat/train/rattled-1000:/omat/train/rattled-500-subsampled:/omat/train/rattled-relax"
valid_file: "/omat/val/aimd-from-PBE-1000-npt:/omat/val/aimd-from-PBE-3000-nvt:/omat/val/rattled-1000-subsampled:/omat/val/rattled-300:/omat/val/aimd-from-PBE-3000-npt:/omat/val/rattled-300-subsampled:/omat/val/rattled-500:/omat/val/aimd-from-PBE-1000-nvt:/omat/val/rattled-1000:/omat/val/rattled-500-subsampled:/omat/val/rattled-relax"
This format will read the LMDB files inside each folder concatenated with :
and load them.