This is the official opensource for the paper LLM-SAP: Large Language Model Situational Awareness Based Planning
.
The paper is available at arXiv.
- LLM-SAP: Large Language Model Situational Awareness Based Planning
- Dataset Introduction
- Quick Start
- Citation
This dataset mainly includes hazardous scenarios from 24 home scenarios.
This includes a list of scenes, scenario events, planning complexity, detailed scenario descriptions, corresponding image descriptions, best human-written finite state machine demonstrations, and approximate visualization images of the finite state machine.
The detailed scene content can be viewed and used in CSV
or JSON files under dataset
.
In addition, we have prepared around 600 high-quality multimodal situational awareness planning datasets generated based on the SAP prompt method provided in this article, which can be used for fine-tuning GitHub or Hugging Face.
The dataset folder contains the main dataset content, including the test set 24_Home_Hazard_Scenario and training set Household_ Safety.
The finite state machine is generated by the test mentioned in the paper and the measurement results of the state machine are generated in the folder generated_FSM and eval_result.
Among them, regarding generated_FSM and eval_result, a detailed explanation is explained below.
- baseline prompting template
- SAP prompting template
- Comparation eval prompting template
- Single round eval without feedback prompting template
- Second round eval with feedback prompting template
- Feedback prompting template
- Ablation study 1 normal prompting template
- Ablation study 1 SAP prompting template
- Ablation study 2 Zero_shot_COT prompting template
- Ablation study 2 EP05 prompting template
- Ablation study 2 EP09 prompting template
GPT-4 generated results
- GPT-4+SAP
- GPT-4
GPT-3.5 generated results
- GPT-3.5+SAP
(The demo is shown here) - GPT-3.5
Claude-2 generated results
- Claude2+SAP
- Claude2
- GPT-3.5+SAP
GPT-3.5+SAP was selected as the test baseline. - Claude-2 eval
GPT-3.5+SAP would be evaluated according to the best demo in the corresponding scene and generated with feedback. - Regenerate with feedback
New FSM results would be generated by GPT-3.5 according to feedback. - Claude-2 eval new FSM
GPT-3.5+SAP+feedback would be evaluated with GPT-3.5 based on the best demo.
Or evaluated with GPT-4+SAP based on the GPT-3.5+SAP result.
(The whole demo is shown here)
The format prompt is shown in the prompt template.
The Ablation tests are only generated by GPT-4.
- GPT-4 format wtih SAP
Generation result here - GPT-4 format without SAP
Generation result here - Vicuna13b format with SAP
Only get the generation from Vicuna, not been evaluated. Generation result here - Vicuna13b format without SAP
Only get the generation from Vicuna, not been evaluated. Generation result here
The result of ablation study1. Only evaluated by GPT-4.
The prompt of Zero_shot_COT is: Let's think step by step.
Generation result here
The prompt of EP05 is: Are you sure that’s your final answer? It might be worth taking another look.
Generation result here
The prompt of EP09 is: Stay focused and dedicated to your goals. Your consistent efforts will lead to outstanding achievements.
Generation result here
The generated FSM are evaluated by GPT-4 and Claude-2. Please find the corresponding results in the folders.
The initial result of GPT-3.5+SAP is used here in the first loop.
The second loop of the GPT-3.5+SAP+feedback result is shown here and here
The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is GPT-3.5 with SAP, x_4 is GPT-3.5 without SAP, x_5 is Claude-2 with SAP, x_6 is Claude-2 without SAP.
The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is GPT-3.5 with SAP, and x_4 is GPT-3.5 without SAP.
The result of the ranking test by GPT-4 here and by Claude-2 here
In the result txt file, x_1 is GPT-4 with SAP, x_2 is GPT-4 without SAP, x_3 is Claude-2 with SAP, and x_4 is Claude-2 without SAP.
Only recorded the evaluation of GPT-4.
The result of the ranking test about Claude-2 here In the result txt file, x_3 is Claude-2 without SAP, x_4 is Claude-2 with SAP.
The result of the ranking test about GPT-3.5 here In the result txt file, x_3 is GPT-3.5 without SAP, x_4 is GPT-3.5 with SAP.
The result of the ranking test about GPT-4 here In the result txt file, x_3 is GPT-4 without SAP, x_4 is GPT-4 with SAP.
The result of ablation study 1 evaluated by Claude-2 here In the result txt file, x_3 is GPT-4 without SAP and x_4 is GPT-4 with SAP.
Only evaluated by GPT-4.
The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with Zero_shot_COT and x_4 is GPT-4 with SAP.
The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with EP05 and x_4 is GPT-4 with SAP.
The result of ablation study 2 here
In the result txt file, x_3 is GPT-4 with EP09 and x_4 is GPT-4 with SAP.
@article{wang&zhong2024SAP_LLM,
title={LLM-SAP: Large Language Model Situational Awareness Based Planning},
author={Liman Wang and Hanyang Zhong},
year={2024},
eprint={2312.16127},
archivePrefix={arXiv},
primaryClass={cs.AI}
}