-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distributed merge of per-rank Megatron data files #55
Conversation
@thomasw21 , I've refreshed this now that the other PR has been merged. This should be usable via This adds some MPI code in I'll add an option to let one toggle between the parallel and sequential merge implementations. When using MPI, it currently does both. Having both in one run was handy so I could run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I haven't finished yet, but since I'm going away on holidays today, you can have at least a partial review. Overall looks very promising, especially since we noticed that merging can take a long time and we were only running on a single cpu. What I would recommend:
- Can you make a
mpi-less
implementation? I think MPI can help, but essentially this would apply not only in multi node setting, but also in multiprocess setting. This would allow the two other scripts to leverage that parallel merge. - If you plan to do it with mpi4py, I think you should implement this with torch.distributed also.
- I never really asked before, but maybe we can benchmark torch.distributed vs mpi4py . The reason why is we're essentially doing the same thing in both, and we already install torch. So if mpi4py doesn't bring substantial improvement let's remove it.
- can you add comments especially where you move the position on the FileObject, I think it's much clearer if I can really easily see where we are, "ie at the end of the size part' or something.
Otherwise great work like always! I'll be back on monday to review the rest of the PR!
megatron/data/indexed_dataset.py
Outdated
@@ -417,21 +417,21 @@ def __init__(self, path, skip_warmup=False): | |||
offset = stream.tell() | |||
|
|||
if not skip_warmup: | |||
print_rank_0(" warming up index mmap file...") | |||
# print_rank_0(" warming up index mmap file...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that it'd be useful to have these. When running with lots of procs, these status messages clutter up the output quite a bit, and they print in a random order since each process prints them independently.
It'd be good to have a way to silence them. For now, I commented them out for testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't rank_0
force only one process to print? After reading the code this seems to work only to torch.distributed
. I'd be a favor of dropping mpi4py
if it doesn't bring substantial performance improvement.
megatron/data/indexed_dataset.py
Outdated
|
||
# To create the binary files given a set of per-rank binary | ||
# files, one simply concatenates the data from the per-rank | ||
# binary files in rank order. We stat each rank file to determine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember you saying something about stat taking some time before being updated correctly? Am I hallucinating, or do we have potentially a race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the backing file system being used and also how the procs access the files. On file systems like Lustre and GPFS, the stat info will be up-to-date and consistent to all procs on all nodes.
On a file system like NFS, procs on one node may see a different file size than procs on another node due to client-side caching of file metadata. However, even on NFS, the stat info should be up-to-date and correct from the process where the file was written, and the algorithm used here satisfies that.
Having said all of that, writing any type of shared file in NFS is risky (not well-supported). I think the stance we should take is to say this is perfectly safe for Lustre/GPFS. It may or may not work on NFS -- users can try, but they do so at their own risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum so I'm not familiar with these notions, I'm guessing JZ is Lustre or GPFS. After reading the doc, it seems to be the case:
Spectrum Scale system of parallel files (ex-GPFS)
Parallel storage device with SSD disks (GridScaler GS18K SSD) with a capacity of 1 PB
If so, can you add something at the top of file in order to document that? And maybe someone with more expertise on this can take a look. I guess I'll run the code a few times, and check that it matches the vanilla version.
megatron/data/indexed_dataset.py
Outdated
index = MMapIndexedDataset.Index(infile) | ||
sizes = index.sizes | ||
if rank == 0: | ||
docs = index.doc_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So careful, this has a strong assumption that rank=0
has a bin file offset of 0. Though it's guaranteed currently, make sure to document somewhere.
megatron/data/indexed_dataset.py
Outdated
f.write(MMapIndexedDataset.Index._HDR_MAGIC) | ||
f.write(struct.pack('<Q', 1)) | ||
f.write(struct.pack('<B', code(dtype))) | ||
f.write(struct.pack('<Q', size_count)) | ||
f.write(struct.pack('<Q', docs_count)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we want to abstract it to be generci to the other dataset index. Essentially we have to create a new class, something along the line of partial write. It needs to support the following;
- initialize() -> ran only on
rank=0
where it adds all the header - seek(offset) -> Essentially you know where you need to add the list.
- write() -> current write function.
megatron/data/indexed_dataset.py
Outdated
|
||
# The list of size values from each rank are | ||
# concatenated and stored as int32. | ||
f.seek(pos + size_offset * np.int32().itemsize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we abstract this into write? Essentially write needs 4 things:
- sizes
- docs
- offset_sizes
- offset_counts
Co-authored-by: Thomas Wang <[email protected]>
Co-authored-by: Thomas Wang <[email protected]>
I'll check on that. I think a
Right, I suspect performance should be similar, especially because there is not a lot of communication in these implementations. In my case, the MPI version is handy because it's easier and less error prone to launch MPI jobs on my system. I also know that interface a bit better and it offers a wider set of operations, so it's quicker to develop and test. I'm then circling back to add
Thanks for taking a look already @thomasw21 . Enjoy your holidays! |
The functions I have in there right now are somewhat specific to the needs of the It should be straight-forward to extend this to provide a new "merge files" function that takes a list of part files and an object defining the distributed environment. Because the final merge file has to created specially when one will be writing to it using multiple ranks, some of this would likely need to go into the class constructor. I'm thinking we could add that function as a method to the dataset class. So for example the main logic in
to:
The Then we drop the for loop and just pass the full list of files to the merge function. The ranks in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally finished the review, I know it's still WIP, but since I hadn't looked at the entire PR, I re-reviewed it. This looks incredible, in particular the performance gain we seem to obtain from this:
- I'd advocate for the removal of
mpi4py
. - Please factorize some code, I'm sure there's a good abstraction allowing to re-use some of the existing code.
- If multiprocessing implementation becomes tricky, feel free to put it away for now. We can always come back to it.
- Can you share the scripts you're running to compare them? I'll try to run them this week in order to obtain a benchmark on our setup.
megatron/data/distdata.py
Outdated
from mpi4py import MPI | ||
self.MPI = MPI | ||
except: | ||
#print(f"ERROR: mpi4py requested, but failed to import, falling back to torch.distributed.", flush=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of raising an exception and stopping the script, but really this is a personal opinion.
megatron/data/distdata.py
Outdated
|
||
class DistData(object): | ||
def __init__(self, backend='gloo', use_mpi4py=False): | ||
self.MPI = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing as torch.distributed
can use MPI as its backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed that variable to be clear
import torch | ||
import torch.distributed as dist | ||
|
||
class DistData(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that's an awesome abstraction for multinode, especially in the case where we might want to support both torch.distributed
and mpi4py
. Though as I'm thinking about it, supporting mpi4py will become costly very soon, and might not bring much to the table as we might need to implement all improvements in two frameworks that essentially can use the same backend mpi
. As you mentioned in a comment we should be able to do everything using torch.distributed
, so let's remove mpi4py and come back to it the day torch.distributed
isn't enough.
Sorry for my mistake, I thought it would be interesting to support both case, but I feel it ends up being a burden here where we need to create a higher level abstraction for not much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would require some maintenance. I'll can yank out the MPI just before we merge the PR.
megatron/data/distdata.py
Outdated
import torch.distributed as dist | ||
|
||
class DistData(object): | ||
def __init__(self, backend='gloo', use_mpi4py=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know what's the difference in terms or perfomance between gloo and mpi?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't measured this. There is not a ton of communication going on in this script, so it likely doesn't make much difference, especially since this will be bottlenecked by I/O performance.
Under the "Which backend to use?" section, they generally recommends that people stick with gloo
, unless they really want mpi
.
https://pytorch.org/docs/stable/distributed.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not know we had to build pytorch from source in order to get mpi working ...
tools/preprocess_dataset_mpi.py
Outdated
args.dist_merge = True | ||
if args.dist_merge: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's odd ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a placeholder for a new --merge
option. Just added that.
tools/preprocess_dataset_mpi.py
Outdated
for key in args.columns: | ||
filemain = get_filename(args, key) | ||
filerank = get_filename(args, key, args.rank) | ||
gather_files_dist(filemain, [filerank], args.distctx, dtype=best_fitting_dtype(args.vocab_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use merge_files_dist
in order to have all the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye that I'm not using merge_files_dist
here. It's used by the merge_preprocessed_data.py
script, which I've updated to use the distributed merge. I can add that in a future PR. If it helps, I can move that function to the other PR.
There is a subtle difference between merge_files_dist
and gather_files_dist
that I haven't explained well in the comments.
In merge_files_dist
, all ranks must provide the full list of files to be merged, and it is assumed that any rank can read any of those files.
In the gather case, each process provides a subset of files that it will contribute. It also may be that only the calling process can access that subset of files. The gather call allows one to optionally store the per-rank file on node-local storage like /dev/shm
or a node-local SSD, which might be only accessible to the calling process.
tools/preprocess_dataset_mpi.py
Outdated
# rename files for now and also do regular merge so we can time both and "cmp" them | ||
if args.rank == 0: | ||
binfile = data_file_path(filemain) | ||
idxfile = index_file_path(filemain) | ||
os.rename(binfile, binfile + ".par") | ||
os.rename(idxfile, idxfile + ".par") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay maybe you can add a TODO so I won't miss it before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is useful now with the --merge=both
option, so I've updated it for that.
tools/preprocess_dataset_mpi.py
Outdated
os.rename(binfile, binfile + ".par") | ||
os.rename(idxfile, idxfile + ".par") | ||
|
||
args.distctx.barrier() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_sum
already acts as a barrier no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true. I'll update the comment to note that it's also acting as a barrier (in case someone later decides to drop the timing/report bit).
tools/preprocess_dataset_mpi.py
Outdated
startup_end = time.time() | ||
if args.rank == 0: | ||
print("Seconds to startup:", startup_end - startup_start) | ||
print(f"Seconds to startup: {startup_end - startup_start}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥳
Thanks for the re-review, @thomasw21 . I'll try to get back to make these changes in the next day or so. In the meantime, I think it should work for you if you want to try it on multiple nodes. I did also see that JZ has GPFS for several of its directories, so I think it should be fine. At least it seems that HOME, WORK, and SCRATCH are GPFS: http://www.idris.fr/eng/jean-zay/cpu/jean-zay-cpu-calculateurs-disques-eng.html This PR currently writes out two sets of files. It runs the parallel merge, renames those files to a
The cost for each merge is measured and reported separately so you can get the timing for each from one run. I then just use
This is all for testing. Longer term, we could perhaps just settle on the parallel merge. However, there might be a use case to still leave the serial merge in there. For example that might be a decent workaround for someone who does not have a file system like Lustre/GPFS. If so, we could add a BTW, the |
Oh, and to run, I've been doing something like:
That I know you'd like to drop the mpi4py, but just in case you want to try that anyway, something like the following should work:
The main difference here is the addition of You can toggle between |
@thomasw21 , heads up that I just pushed a commit that changes the usage to add a And I merged in main to pick up the cached dataset |
Co-authored-by: Thomas Wang <[email protected]>
Let me know when it's up for review, I'll try spending some time on it later in the week. |
Quick update: I'm currently trying to run your multi node version, and I seem to have some issues: CodeIn export MASTER_ADDR=$SLURM_SUBMIT_HOST # "$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n1)"
export MASTER_PORT=12345
export DISTRIBUTED_ARGS=" \
--nproc_per_node 40 \
--nnodes $SLURM_JOB_NUM_NODES \
--node_rank $SLURM_NODEID \
--master_addr $MASTER_ADDR \
--master_port $MASTER_PORT \
"
PREPROCESSED_DATASET_PREFIX=$SCRATCH/data/processed/openwebtext-gpt2
export PROCESSING_ARGS=" \
--input stas/openwebtext-10k \
--count 10000 \
--output-prefix $PREPROCESSED_DATASET_PREFIX \
--merge both \
--dataset-impl mmap \
--tokenizer-type PretrainedFromHF \
--tokenizer-name-or-path gpt2 \
--seed 101
"
time python -m torch.distributed.launch $DISTRIBUTED_ARGS Megatron-DeepSpeed-Adam/tools/preprocess_data_dist.py $PROCESSING_ARGS and then call this via IssuesI'm getting the following when running on 4/8 nodes (works correctly for 1/2 nodes): ValueError: ProcessGroupGloo::scatter: invalid tensor size at index 80 (expected (32), got (31)) I'm not sure why this happens? My small guess is for X reason, a process is awaiting something of size 32 and got 31. This sounds a lot like the remainder of indices. The reason why things would work for 1/2 is that Maybe I'm missing something? I'll look more in depth tmr. I might have gotten the script very wrong as I'm not used to using |
Oh, shoot. Yes, I can reproduce that. Apparently, I hadn't tested that configuration. I guess it's a true scatter rather than scatterv in torch.distributed. I'll work on a fix. |
@thomasw21 , I just pushed a commit that seems to fix it for me. Would you please try again when you get a chance? |
# Receive a tensor of the max count size from the root, | ||
# then copy values into output numpy array, which may be smaller. | ||
recvtensor = torch.zeros(max(counts), dtype=torch.int64) | ||
dist.scatter(recvtensor, scatterlist, src=root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about scatter_object_list
? I tried it and it seems to work nicely, though I don't know what's the cost using that method vs padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, interesting. So I seem to be using an old enough torch.distributed
that I don't have scatter_object_list
:
AttributeError: module 'torch.distributed' has no attribute 'scatter_object_list'
It looks like it was added about 9 months ago in this commit:
I can also see that it internally is just calling broadcast and scatter a couple of times.
After seeing that, I think our pad method is probably the best way to go after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's go with padding.
megatron/data/distdata.py
Outdated
for num, s in zip(counts, slices): | ||
padtensor = torch.zeros(max(counts), dtype=torch.int64) | ||
padtensor[:num] = s | ||
scatterlist.append(padtensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scatter_object_list
doesn't work for you, maybe we can improve slightly.
import torch.nn.functional as F
slices = torch.split(torch.from_numpy(invals), counts)
max_size = max(counts)
scatterlist = [F.pad(slice, (0, max_size - len(slice))) for slice in slices]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's cleaner. Thanks for the tip.
I did that just to try it and pushed a commit in case you want to use it. I can also try scatter_object_list
. I suspect that should also work.
I suppose the tensor-based method could be more efficient communication-wise (no pickle step), though this scatter step will not take much time in either case compared to the total time of the script.
The bigger concern might be the memory required on rank 0 to put together the arguments for the scatter. With the mpi4py Scatterv
, I know mpi4py
sends data from the original numpy array. With torch, it looks like we'll at least be doubling the memory by effectively slicing up the original numpy list into these per-rank tensors/sublists. I don't know if it would be worse the doubling the memory -- it depends on the implementation under the covers. However, even that likely won't be an issue until the input index list is really big, at which point, we can always fall back to the file-based scatter.
Whichever you prefer is good with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the day we can't fit everything on a single process, we'll think of a better way IMO. (perhaps bring backmpi4py
). Let's stick to the padding strategy.
megatron/data/distdata.py
Outdated
# then copy values into output numpy array, which may be smaller. | ||
recvtensor = torch.zeros(max(counts), dtype=torch.int64) | ||
dist.scatter(recvtensor, scatterlist, src=root) | ||
outval[:] = recvtensor[:counts[self.rank]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't benefit from the inplace operator, let's return the tensor instead of doing an inplace operation? Typically you could return only recvtensor[:counts[self.rank]]
which would remove the need to initialise outval
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion. Made that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall that looks good to me. However when running:
idx
size seem to differ, probably an offset that's wrong for X reason.
> cmp -b openwebtext-gpt2_text_document.idx openwebtext-gpt2_text_document.idx.par
openwebtext-gpt2_text_document.idx openwebtext-gpt2_text_document.idx.par differ: byte 30115, line 118 is 12 ^J 0 ^@
You seem to have already integrated your fix in this branch, I suspect something about a wrong offset.
- I see no performance improvement using 4 nodes, in fact I see significant loss in performance.
config | nodes | cpus_per_node | serial | parallel |
---|---|---|---|---|
stas/openwebtext-10k | 1 | 40 | 0.06870174407958984s (314.938 MB/s) | 0.18674302101135254s (115.864 MB/s) |
stas/openwebtext-10k | 4 | 40 | 0.5817015171051025s (37.204 MB/s) | 0.6286804676055908s (34.424 MB/s) |
I see potentially two issues:
stas/openwebtext-10k
might be just too small. Going to try on openwebtext.- That might be linked to our setup where we're not using SSD. Thought that wouldn't really explain the massive slow down when increasing the number of nodes. Will investigate.
def scatterv_(self, invals, counts, outval, root=0): | ||
"""Scatter int64 values from invals according to counts array, receive values in outval""" | ||
|
||
self.allassert(len(counts) == self.numranks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put that in future PR. For me the PR introduces the feature, we could always improve on it later on.
@thomasw21 , thanks for running those tests. Regarding the index How many items are you selecting from openwebtext? I'm trying to get a handle on what section of the file that might be. Are the output files being written to one of the GPFS directories? |
|
I'm a bit stumped on the In that case, do you know how many total ranks were you using? |
Sorry for the short reply earlier, I was away from my computer. I re-ran it, it seems to be fine .... so I don't know what was wrong. Let's ignore it for now as it was likely an error from my end. However I am seeing a drastic slowdown currently (I re-ran FYI I'll go on holidays starting friday, so I might merge this seeing as I'm confident it works (just not confident it runs any faster on our setup). I'll benchmark everything upon coming back or let @stas00 try to run the scripts. |
Ok, good. I was focused on finding the cause of the correctness problem first, but it sounds like that might have been a false alarm. Please let me know if that shows up again. Regarding performance, yes, the numbers you are seeing are quite bad and scaling poorly. If I try the same configuration with
On file systems that allow parallel writes to a file, the cost can be high if processes writes to sections that are too close to each other. The file system servers often implement some form of byte range locking to ensure that only one client is writing to a given region at a time, and I've seen cases where it's easy to thrash those lock servers, especially in false sharing scenarios. Since this dataset is relatively small, the processes will be slicing up the file pretty finely. How do things fare for a larger dataset? As a next step up, maybe try |
Also, heads up that things can seem to get stuck after this message for a while:
Rank 0 prints this message when it finishes its portion of the work, and then it waits for everyone in a barrier. Rank 0 can finish much earlier than the full set of procs due to load imbalance across the processes. With that barrier, everyone has to wait on the slowest process, and that's probably the one that ended up with the most amount of text to process. I have seen a factor of 2 in much of my testing. Things are better if the data is shuffled. I have some ideas to improve load balance in future work if that becomes an issue. |
Can't seem to benchmark on
|
Okay let's merge this, it's safe to say that it works, and doesn't break other scripts. This will allow us to look at the jsonl version of that script. A few things I think we might want to do:
I'll create issues when I come back, but if you're interested in any of them/or want to challenge any ideas feel free to ping me! and also .... AWESOME WORK!!! Thanks again for the massive contribution! If you have any interest in bigscience, feel free to join the slack. cc @stas00 |
The test suite can't handle this addition. Why was offline mode enabled? How can it get the required files then? and the dataset is incorrect, should be "--input stas/openwebtext-10k"
|
Thanks for all of your help getting this into shape, @thomasw21 ! Enjoy your holidays! |
* xpu support (bigscience-workshop#55) * port accel abs interfece * WA for run3.6b * move on * fix current_dievice * fix typo * enable to run 345M GPT * delete apex_patch * add TODO xpu compatible tg for xpu WA * use deepspeed launcher * enable run3.6b bf16 * add zero2 config json * readd enable_each_rank_log * fix typos * add ccl arg * fix * use short word * use no-masked-softmax-fusion * readd * set train iters to 10 * remove duplicate line * change assert msg * update format * add whitespace * update path * update note * update * fix typos * delete notes * update format * update xpu check to cuda check * update * clean up file * fix typos * add python based gradient clipping * change condition for python based path
This can speed up the merge step, but it requires that the user is writing the final dataset to a POSIX-complaint parallel file system, like Lustre or GPFS. Each rank identifies the file offsets for its own data using collective operations,
fseek
's to those sections, and writes its data.This adds a
--merge
option topreprocess_data_dist.py
, which can be set to any of{parallel, serial, both}
. It defaults toparallel
, but one can fallback to the algorithm where rank 0 merges all files sequentially with--merge serial
. A serial merge might be helpful to people where the parallel merge does not work due to lack of a POSIX-compliant parallel file system. Theboth
option is useful for testing purposes. It merges rank files with both parallel and serial so that the resulting files can be compared with something likecmp
.An optional
--scratch
option can be used to store intermediate per-rank files in storage local to the compute node, like/dev/shm
, which avoids creating those files on the shared file system and offers faster write/read performance, e.g.,TODO:
torch.distributed
<B
is encoded as a single byte as expectedScaling tests:
In running tests to check encoding rates at different node counts, I also get the merge time at the end. The script actually does both a parallel merge and a serial merge, so that I can compare their contents. That also provides an easy way to gather times for both. The parallel merge can optionally write the per-rank file file to a scratch directory with
--scratch
, like /dev/shm, which removes load from the parallel file system.Each rank writes its own file, and I'm running 40 ranks per node. Times here are in seconds. Test results can vary based on how busy the (shared) file system is at the time. I've only taken one sample here.
The final merged file is the same size in all cases (529GB), but as the number of ranks increases, the script generates more per-rank files with each one being smaller. The total data being processed is the same, but the file counts can vary. Ideally, if things are bandwidth bound, you'd expect a constant time across each row.
Anyway, the main takeaway is that there is a nice boost using the parallel merge.
The scratch times aren't showing much improvement over writing the per-rank files to the parallel file system. My guess is that the OS has cached the per-rank file in page cache, so it's reading back from memory even when the per-rank file is written to the parallel file system. There might still be some impact on the cost to create and delete those files, but I'm not recording that.