This is a sample of the traces from multiple LLM inference services in Azure, collected on November 11th 2023. This dataset is the data described and analyzed in the ISCA 2024 paper 'Splitwise: Efficient generative LLM inference using phase splitting'.
The dataset comprises this description and a Jupyter Notebook with the plots in the ISCA paper.
The data is made available and licensed under a CC-BY Attribution License. By downloading it or using them, you agree to the terms of this license.
If you use this data for a publication or project, please cite the accompanying paper:
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, Ricardo Bianchini. "Splitwise: Efficient generative LLM inference using phase splitting", in Proceedings of the International Symposium on Computer Architecture (ISCA 2024). ACM, Buenos Aires, Argentina, 2024.
Lastly, if you have any questions, comments, or concerns, or if you would like to share tools for working with the traces, please contact us at [email protected]
You can download the datasets here:
Field | Description |
---|---|
TIMESTAMP | Invocation time |
ContextTokens | Number of context tokens |
GeneratedTokens | Number of generated tokens |
Due to customer privacy requirements (e.g., GDPR), we do not have visibility into the content of the prompts. We instead use the production traces to guide the input and output sizes, where we send the input prompt with the required number of tokens, and force the model to generate the corresponding number of output tokens for each request. Note that the text of the inputs prompts does not impact the performance metrics that we benchmark, since they depend only on the input and output sizes.
This data is the sample data used in the ISCA paper mentioned above. To verify the data, we reproduce the characterization graphs in the paper using the released trace in this Jupyter Notebook.