Large Dataset Pre-processing

HDF5 files preprocessing

If you have a large dataset that might not fit into the GPU memory it is recommended to preprocess the data on a CPU and use on-line dataloading for training the model. To preprocess your dataset specified as an xyz file run the preprocess_data.py script. An example is given here:

mkdir processed_data
python <mace_repo_dir>/mace/cli/preprocess_data.py \
    --train_file="/path/to/train_large.xyz" \
    --valid_fraction=0.05 \
    --test_file="/path/to/test_large.xyz" \
    --atomic_numbers="[1, 6, 7, 8, 9, 15, 16, 17, 35, 53]" \
    --r_max=4.5 \
    --h5_prefix="processed_data/" \
    --compute_statistics \
    --E0s="average" \
    --seed=123 \

This will create a directory processed_data with the preprocessed data as h5 files. There will be one folder for training, one for validation and a separate one for each config_type in the test set. To see all options and a little description of them run python ./mace/scripts/preprocess_data.py --help . The statistics of the dataset will be saved in a json file in the same directory.

The preprocessed data can be used for training the model using the on-line dataloader as shown in the example below. If the pre-processing is done on multiple threads, you should end up with multiple files in the processed_data directory. For example, if you have 4 threads, you should see the following files:

ls processed_data/
statistics.json  test  train  val

ls processed_data/test/

ls processed_data/train/
train_0.h5  train_1.h5  train_2.h5  train_3.h5

ls processed_data/val/
val_0.h5  val_1.h5  val_2.h5  val_3.h5

In this case, you should use the following command to train the model:

python <mace_repo_dir>/mace/cli/run_train.py \
--name="MACE_on_big_data" \
--num_workers=16 \
--train_file="./processed_data/train" \
--valid_file="./processed_data/val" \
--test_dir="./processed_data/test" \
--statistics_file="./processed_data/statistics.json" \
--model="ScaleShiftMACE" \
--num_interactions=2 \
--num_channels=128 \
--max_L=1 \
--correlation=3 \
--batch_size=32 \
--valid_batch_size=32 \
--max_num_epochs=100 \
--swa \
--start_swa=60 \
--ema \
--ema_decay=0.99 \
--amsgrad \
--error_table='PerAtomMAE' \
--device=cuda \
--seed=123 \

LMDB files preprocessing

Note

This feature is available from version 0.3.11.

MACE support the LMDB data format from fairchem for large dataset pre-processing, such as OMAT. Here is an example YAML file for the LMDB data format:

heads:
omat_pbe_refit:
    train_file: "/omat/train/aimd-from-PBE-1000-npt:/omat/train/aimd-from-PBE-3000-nvt:/omat/train/rattled-1000-subsampled:/omat/train/rattled-300:/omat/train/aimd-from-PBE-3000-npt:/omat/train/rattled-300-subsampled:/omat/train/rattled-500:/omat/train/aimd-from-PBE-1000-nvt:/omat/train/rattled-1000:/omat/train/rattled-500-subsampled:/omat/train/rattled-relax"
    valid_file: "/omat/val/aimd-from-PBE-1000-npt:/omat/val/aimd-from-PBE-3000-nvt:/omat/val/rattled-1000-subsampled:/omat/val/rattled-300:/omat/val/aimd-from-PBE-3000-npt:/omat/val/rattled-300-subsampled:/omat/val/rattled-500:/omat/val/aimd-from-PBE-1000-nvt:/omat/val/rattled-1000:/omat/val/rattled-500-subsampled:/omat/val/rattled-relax"

This format will read the LMDB files inside each folder concatenated with : and load them.