-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving all trained params in a single file #7722
Comments
I think we can even handle the "overwrite" file case. We can also have a counter for the |
Let me take an example to explain the saving format as I understand. If there are two parameters,
Am I right? I think we may need another string to record the parameter's name.
In the first implementation, we may fill the parameter's Tensor in the |
So I am thinking of it more like this:
(Otherwise, we won't know the size of the string fc0.w0/fc0.b0 beforehand, so we need to merge it with the serialization of the LoDtensor, and then generate the size) |
@Xreki : I think the name isn't required, it is obtained from the programDesc, and passed accordingly: code. And since we are storing the programDesc as a protobuf then i don't see the need of storing it again (provided the ordering when iterating through For a concrete example, if we have fc1, b1, fc2, b2. So the order will be the same when we call We iterate over this list, and pass an additional counter for save and load: While saving:
.
Now while loading we proceed as:
Now we read the uint64 to get the bytes for fc1 and read those many bytes, then deserialize to obtain fc1.
Now we read the uint64 to get the bytes for b1 and read those many bytes, then deserialize to obtain b1. .
Now we read the uint64 to get the bytes for b2 and read those many bytes, then deserialize to obtain b2. |
OK. The design is based on the order. We need to make sure the loading order is totally the same as the saving order. |
Merging all params in a single file
For inference, we will to have 2 files, one for the
programDesc
and one that has all the params together. We look at 1 approach to do this.Understanding save/load ops (C++ side)
To understand the current serialization: we look at
save_op
In
save_op
the main work is performed bySerializeToStream( <ofstream>, <framework::LoDTensor>, .. )
Code. This function saves a version number, size of LoD and actual LoD data.Then it calls,
SerializeToStream(<ofstream>, <Tensor> ..)
Code. This function saves a version number, tensor description as a serialized protobuf, and the actual data.The corresponding
load_op
basically does the deserialization accordingly (respecting the ordering in thesave_op
).Understanding how a model is saved (python api)
Now, we look at how the save/load works for saving actual model params, we look at the implementation of
save_vars
in fluid. Code. We see that a new program is created withsave
op is appended for eachvars
which is persistable. Then the executor runs this program.Approach
We basically make two assumptions:
overwrite
option which is insave_op
.While saving:
We basically store a
uint64_t
number in addition to the actual serialized bytes as in the originalsave
. This number will tell us about the size of the serialized LoDTensor in bytes.When the
save
is called for the first time, we will create a file, create a string that will have serialized LoDTensor data. Now we store the size of this string first in a fixed width (uint64_t
) number, and then store the string.When the
save
is called later, we basically go to the end of the file, and store 2 things: the size of the string and the string itself.While loading:
We pass an additional attribute, in order to load the correct chunk of parameter. So we pass a counter value (which counts from 0 the relative order of the different params).
With this counter and the extra size information that we stored, we can hop to the appropriate part of the file, and read the chunk, and deserialize it.
For implementation, i think it will be better to have another op for this (rather than replacing the original save_op/load_op, so that is easier to debug, and i don't know the details of how the load_op and save_op are used in distributed version as of now).
The text was updated successfully, but these errors were encountered: