distributed merge of per-rank Megatron data files #55

adammoody · 2021-08-09T18:58:20Z

This can speed up the merge step, but it requires that the user is writing the final dataset to a POSIX-complaint parallel file system, like Lustre or GPFS. Each rank identifies the file offsets for its own data using collective operations, fseek's to those sections, and writes its data.

This adds a --merge option to preprocess_data_dist.py, which can be set to any of {parallel, serial, both}. It defaults to parallel, but one can fallback to the algorithm where rank 0 merges all files sequentially with --merge serial. A serial merge might be helpful to people where the parallel merge does not work due to lack of a POSIX-compliant parallel file system. The both option is useful for testing purposes. It merges rank files with both parallel and serial so that the resulting files can be compared with something like cmp.

An optional --scratch option can be used to store intermediate per-rank files in storage local to the compute node, like /dev/shm, which avoids creating those files on the shared file system and offers faster write/read performance, e.g.,

--scratch /dev/shm

TODO:

add support for torch.distributed
avoid deadlock in case some process throws an exception
test corner cases, e.g., rank 0 contributes 0 items
double check why version "byte" seems to use 2 bytes -- resolved, <B is encoded as a single byte as expected

Scaling tests:
In running tests to check encoding rates at different node counts, I also get the merge time at the end. The script actually does both a parallel merge and a serial merge, so that I can compare their contents. That also provides an easy way to gather times for both. The parallel merge can optionally write the per-rank file file to a scratch directory with --scratch, like /dev/shm, which removes load from the parallel file system.

Each rank writes its own file, and I'm running 40 ranks per node. Times here are in seconds. Test results can vary based on how busy the (shared) file system is at the time. I've only taken one sample here.

nodes:       8     16    32    64
serial:    617    499   718     -
parallel:   16.1   15.6  17.0  26.8
/dev/shm:    -     14.8   -    22.4

The final merged file is the same size in all cases (529GB), but as the number of ranks increases, the script generates more per-rank files with each one being smaller. The total data being processed is the same, but the file counts can vary. Ideally, if things are bandwidth bound, you'd expect a constant time across each row.

Anyway, the main takeaway is that there is a nice boost using the parallel merge.

The scratch times aren't showing much improvement over writing the per-rank files to the parallel file system. My guess is that the OS has cached the per-rank file in page cache, so it's reading back from memory even when the per-rank file is written to the parallel file system. There might still be some impact on the cost to create and delete those files, but I'm not recording that.

adammoody · 2021-08-11T22:40:47Z

@thomasw21 , I've refreshed this now that the other PR has been merged. This should be usable via mpi4py (at least it works for me). The torch equivalent still needs to be written.

This adds some MPI code in megatron/data/indexed_dataset.py, since that defines the file format(s) and implements the sequential merge. To do this cleanly, we'll probably want to define some sort of dist module that can be imported from both tools/preprocess_dataset_mpi.py and megatron/data/indexed_dataset.py.

I'll add an option to let one toggle between the parallel and sequential merge implementations. When using MPI, it currently does both. Having both in one run was handy so I could run cmp on the resulting files to verify file integrity.

thomasw21

Sorry I haven't finished yet, but since I'm going away on holidays today, you can have at least a partial review. Overall looks very promising, especially since we noticed that merging can take a long time and we were only running on a single cpu. What I would recommend:

Can you make a mpi-less implementation? I think MPI can help, but essentially this would apply not only in multi node setting, but also in multiprocess setting. This would allow the two other scripts to leverage that parallel merge.
If you plan to do it with mpi4py, I think you should implement this with torch.distributed also.
I never really asked before, but maybe we can benchmark torch.distributed vs mpi4py . The reason why is we're essentially doing the same thing in both, and we already install torch. So if mpi4py doesn't bring substantial improvement let's remove it.
can you add comments especially where you move the position on the FileObject, I think it's much clearer if I can really easily see where we are, "ie at the end of the size part' or something.

Otherwise great work like always! I'll be back on monday to review the rest of the PR!

thomasw21 · 2021-08-12T15:17:08Z