Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computational graph grows with the sequence length #467

Open
PABannier opened this issue Aug 20, 2023 · 5 comments
Open

Computational graph grows with the sequence length #467

PABannier opened this issue Aug 20, 2023 · 5 comments
Labels
question Further information is requested

Comments

@PABannier
Copy link
Contributor

Hello everyone!

I'm currently enhancing the GGML implementation of a LSTM network.
My main focus is to avoid having scalability issues with the computational graph.

Currently I'm setting GGML_MAX_NODES to a very high value (100k): https://github.com/PABannier/bark.cpp/blob/main/ggml.h#L206C32-L206C38

This is due to the fact that for the LSTM network the number of nodes in the computational graph grows with the sequence length: https://github.com/PABannier/bark.cpp/blob/main/encodec.cpp#L81C25-L81C36 .

I wanted a quick fix in order to have a first POC of bark.cpp. Now that we want to clean things, I'm wondering what's the best solution?

I was thinking if we should not create a ggml context and computational graph per time point to avoid these scalability issues in the graph. One subgraph would be created per time point, run the forward pass and obtain the result of one cell.
It feels hacky and quite costly considering the overhead of building the graph, copying tensors from one context to another, etc.

What do you think would be the best solution?

@lin72h
Copy link

lin72h commented Aug 21, 2023

great works, have you considered rwkv.cpp which also a form of RNN, from my limited knowledge I think it's superior than LSTM.

@ggerganov
Copy link
Owner

Maybe we should refactor ggml_cgraph to allow to dynamically allocate nodes. Something like this:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        struct ggml_tensor * nodes_stack[GGML_MAX_NODES];
        struct ggml_tensor * grads_stack[GGML_MAX_NODES];
        struct ggml_tensor * leafs_stack[GGML_MAX_NODES];

        // by default we work on the stack
        struct ggml_tensor ** nodes = nodes_stack;
        struct ggml_tensor ** grads = grads_stack;
        struct ggml_tensor ** leafs = leafs_stack;

        void * visited_hash_table[GGML_GRAPH_HASHTABLE_SIZE];

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;
    };

When we allocate the graph on the stack, initialize nodes, grads and leafs to point to the stack arrays.
When we allocate it on the heap, we allow to pass arbitrary number of nodes which we malloc and redirect nodes, grads and leafs to them.

The reason I am thinking about something like this is because I want to keep the option have the graph fully allocated on the stack when we can fit it in GGML_MAX_NODES.

Any other suggestions? cc @slaren

@ggerganov ggerganov added the question Further information is requested label Aug 22, 2023
@slaren
Copy link
Collaborator

slaren commented Aug 22, 2023

This is probably the best we can do for now. An improvement would be to move the storage to the end of the struct, so that the space reserved for the stack can still be used when allocated on the heap. Ie:

    struct ggml_cgraph {
        int n_nodes;
        int n_leafs;

        // by default we work on the stack
        struct ggml_tensor ** nodes = storage + 0;
        struct ggml_tensor ** grads = storage + GGML_MAX_NODES;
        struct ggml_tensor ** leafs = storage + GGML_MAX_NODES * 2;

        void ** visited_hash_table = storage + GGML_MAX_NODES * 3;

        // performance
        int     perf_runs;
        int64_t perf_cycles;
        int64_t perf_time_us;

        // may be larger than this when allocated on the heap
        struct ggml_tensor * storage[GGML_MAX_NODES * 3 + GGML_GRAPH_HASHTABLE_SIZE];
    };

The hash table must also grow to hold all the nodes and leafs, and the size should be a prime number, otherwise the number of collisions will be huge and the performance will be affected.

In the long term, I hope we will be able to remove the node limits from the computation graphs and calculate the required memory automatically, but that's probably going to require a larger redesign.

@PABannier
Copy link
Contributor Author

Looking at the rwkv repo, I think the easiest solution for now is to move the computational graph of Encodec on the heap then. I have skimmed through the code and it seems like they are building a computational graph per time point.

@datduonguva
Copy link

You can take a look on my work.
Basically, the computation graph (a GRU in this case) includes only the forward pass of 1 cell. To loop through the entire sequence, at each timestep, we reset the input and call the ggml_graph_compute_with_ctx()

https://github.com/datduonguva/ggml-experiments/blob/master/rnn_text_gen/rnn_text_generation.cpp

Hope this helps! (I am very new to the project so my approach might not be correct).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants