Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separation of execution and planning into different agents #3593

Closed
1 task done
dschonholtz opened this issue Apr 30, 2023 · 11 comments
Closed
1 task done

Separation of execution and planning into different agents #3593

dschonholtz opened this issue Apr 30, 2023 · 11 comments
Labels
enhancement New feature or request Stale

Comments

@dschonholtz
Copy link
Contributor

dschonholtz commented Apr 30, 2023

Duplicates

  • I have searched the existing issues

Summary 💡

The core problem here is the plan must be inferred by past work via memories and it is not clear to the user or the agent how much of the plan was actually accomplished by a given command. Ideally, this would be explicit.

So given a task from the user (or a list of goals) The task planner agent will make a list of next tasks to do with their completion status.

Then we will store these tasks in a queue and pop off the first one with all dependencies done and give it to an execution agent.
The execution agent eventually should have a minified task command set for the problem to reduce unnecessary and incorrect commands, but for now, it will just execute with the agent it currently uses in the main execution loop.

Concretely here this is from an architecture perspective.

The user enters a task they wish the agent to do.

We add a new class called TaskManagerLLM.
As we enter chat.py the objective and current task list if there is one is fed into taskManagerLLM and an empty initial result is fed into it.
Task manager LLM creates up to 7 tasks it should do to accomplish this task. It numbers them with priority and marks dependencies where dependencies are based off of a linked list pointer structure. It also will have the specific command that should be used for each of them. It will not have in it what parameters are to that command though.

It's prompt should be very similar to the existing execution agent with the following components:
Thoughts:
Reasoning:
Task List in JSON
Criticism:
Speak:
Task List JSON.

We then extract that JSON using our existing JSON parsing toolset.
Sort the created tasks without dependencies and then pick the top one.
Give that task to an execution agent.

An execution agent will function the same way that an existing agent does. It will make a plan for it's very simple (by comparison task.)
The execution agent will do a one shot evaluation with the "smart" model to generate the given command with the correct params.
To support this, we will first generate relevant context for it by having one more previous command to an LLM that takes in the output from the planning agent, and the goal, and the given task that has been selected and the results from previously completed tasks and have it output a prompt for the agent that succinctly describes the problem the agent is solving by executing that command.

The biggest potential problem with this is having this function with summation based or vectorized based memory.

For now, each completed task would get added to memory with the reasoning associated with it.

It is possible that the above is overly complex, and we just need an ordered task list based on commands in the planning step and its summarization memory. I might implement that first and see how it does.

Examples 🌈

One example of a specific implementation can be found here: https://github.com/yoheinakajima/babyagi/blob/main/classic/BabyBeeAGI

Motivation 🔦

We have a lot of performance degradation for a few reasons. Myself, and a few of the other people doing benchmarks have kind of realized that it isn't useful to do benchmarks at this point because they mostly just show how unreliable the agent is. No point in doing token counting for instance if the agent only finishes the simplist tasks 50% of the time.

A lot of the problems with memory and elsewhere is the fact that we have all of this random crap that isn't relevant to the current task and we don't really process well with respect to planning what our next task is, or plan well for next task should be, or really process what our previous tasks accomplished were. And we spend a lot of precious token real estate on stuff that is mostly just confusing to the agent.

The basic thought is, if you have a call to an agent that maintains a list of tasks done, their results and what tasks should be accomplished next, and another separate execution agent, that takes that output and then executes on it for a simple task you should get far better performance.

Eventually, I would want to make all of the details for each task queryable so that if previous similar tasks have been executed before, we could feed that into the execution agent, but currently I think I want to keep an initial PR simple.
So my hope is to work on this, and then circle back around and show we can actually finish some benchmarks reliably

@dschonholtz
Copy link
Contributor Author

My plan is to start working on this later tonight and tomorrow. I would like feedback, but let's discuss on discord

@Garr-tt
Copy link

Garr-tt commented Apr 30, 2023

Feels like this is definitely needed for React Development would love to test if for ya

@samuelbutler
Copy link

samuelbutler commented May 1, 2023

the linked example doesn't have syntax highlighting so here is a gist of BabyBeeAGI with syntax highlighting
https://gist.github.com/samuelbutler/deccce61cd825170b3afcc31dd63fbd8

@Boostrix
Copy link
Contributor

Boostrix commented May 3, 2023

I have some thoughts related to this and began making early experiments, and got pretty good results, too:

planning

The basic idea is to use aggressively-tuned prompts to make the LLM respond with lists of steps, which are in turn recursively split up and serialized to distinct JSON files to track "progress per step" (see #822 and #430)

In pseudo code terms, the idea is to provide a new command called "plan_task" which takes a detailed <description> of the task, as well as <relevant constraints>

in app.py, I am using start_agent to start a new sub-agent, whose task is to to take the description/constraints of the task and come up with the corresponding list of steps, including an estimate of the complexity (percentage of the whole task).

The list is then re-ordered based on logical order of steps.

Up to this point, the LLM is handling all the difficult stuff, so we're talking about 20 lines of code at most.

I am then hashing each step on the list to put that info into a JSON file (to track progress).

For each step, I am also creating a dedicated JSON file that contains the actual task at hand, as well as a status/progress field (to track progress).

The idea being, the "plan_task" command could check for existing JSON files inside the workspace (based on the hash) and use those to continue its work.

Once all steps are serialized to disk, we can edit all step files that have a difficulty/complexity higher than some threshold of say 20%. This is to "strategize" and come up with alternatives, to have a fallback plan (which would be also passed to the LLM and saved inside the JSON file). These alternatives/options are re-ordered to prioritize those that have a low complexity and a high probability of success (estimation).

So, whenever one step on the list is at or above 20% difficulty (difficulty being a measure of complexity and the time/resources - aka constraints- #3466 needed to fulfill this step), I hand over that particular step to another agent to let it come up with 3 alternatives (could probably be made configurable, like the percentage). If it fails to come up with alternatives, it is encouraged to do online research/web browsing and gather docs+examples (testing this in a coding context).

These alternatives go into the step's json file, so that the agent trying to complete this step can alternate between these steps/options.
(if an alternative is used, the parent agent needs to be informed in order to be able to adapt if necessary)

Since I haven't had much luck executing multiple tasks directly, I tend to favor agents.
So, it makes sense to provide contextual information as part of the json file for each step. That would be a high level decscription of the upper task, as well as common requiremets between multiple options to execute a step.

This description coud also be used when restarting an agent/task [list] to "prime" the agent and compare/evaluate the results of the previous run (basically a form of regression testing/self check). It would encounter corresponding entries in the json file, the step it was supposed to do, as well as a few alternatives and possibly some progress/state info, which should enable the new agent instance to evaluate the current state of things (if only to save some API tokens by using those for context)

The goal here being to come up with a simple planning framework that is step based, and that can be interrupted/continued, to resume its work later on.

For the time being, I can provide a high level description to create a project with 2-3 files and the planning stage will come up with a list of steps, which can then be sequentially tackled (everything is sequential for now, no parallelism).
The idea is to be able to call a planning/strategy routine that will try different approaches to accomplish a certain task, with the option of suspending/resuming operation.

Ideally, this setup would work for several tasks and could recursively call itself.
At some point, potentially in a concurrent fashion.

One option might be making a dedicated plugin for this, unless this is something that should better be part of the system?
I suppose it would make sense to talk about the various ideas here ...

Any thoughts / ideas ?

Mentioning #3644 and #2409 for the sake of completeness.

@Boostrix
Copy link
Contributor

Boostrix commented May 4, 2023

I've made some more progress by adding more fields to the thought dictionary (JSON respectively), specifically:

  • OBSERVATION - to be filled in by the LLM regarding any observations that would require changing the plan
  • EXPECTATION - to be filled in by the LLM regarding what's going to happen next
  • CONTINGENCY - to be filled in by the LLM to have a contingency in place

This alone actually helps the LLM to "see" problems it previously wasn't able to handle, i.e. I am completing the same objectives using fewer steps (aka less "thinking").

observe

A lot of the problems with memory and elsewhere is the fact that we have all of this random crap that isn't relevant to the current task and we don't really process well with respect to planning what our next task is, or plan well for next task should be, or really process what our previous tasks accomplished were. And we spend a lot of precious token real estate on stuff that is mostly just confusing to the agent.

That's an exceptionally good (and honest) summary -in other PRs, people have begun using completely fresh agent sessions for unrelated tasks such as input validation and applying ethical safeguards (#2701), and it is also my personal observation that prompts tend to be overly "contaminated" with pointless stuff - and in other parts, we're lacking relevant context, such as execution of commands: #2987 (comment)

In other words, it does absolutely make sense to consider using "fresh" contexts and prime those "on demand" with what's essential for the task at hand.

The basic thought is, if you have a call to an agent that maintains a list of tasks done, their results and what tasks should be accomplished next, and another separate execution agent, that takes that output and then executes on it for a simple task you should get far better performance.

That is also my observation, and it seems generally the recommendation around here is to favor "separate agents over separate tasks", exactly for these reasons.

Eventually, I would want to make all of the details for each task queryable so that if previous similar tasks have been executed before, we could feed that into the execution agent, but currently I think I want to keep an initial PR simple.
So my hope is to work on this, and then circle back around and show we can actually finish some benchmarks reliably

I have been playing with contextual memory by hashing agent actions + params to use that via a key/value storage in a Python dict - the basic idea being, if the agent keeps repeating the same commands [actions] over and over again using the same args, it's obviously not progressing - especially if it keeps getting the same response/result, which we can use to set/increment a counter, to detect that we're probably inside a useless loop: #3668

In a sense, a number of seemingly unrelated issues could be unified this way. Because they're all primarily due to a lack of planning and lack of preparing actions before executing them:

This was referenced May 5, 2023
@anonhostpi
Copy link

Possibly related rearch work: #3790

@dschonholtz
Copy link
Contributor Author

This is great. I am trying to work on challenges/benchmarking so if someone wanted to take a crack at this by all means go for it as I don't think anyone is actively working on the code. My guess is we should hold off until the re-architecturing work is done

@Boostrix
Copy link
Contributor

Boostrix commented May 6, 2023

FWIW, when I played with this, I thought that detangling the mess that mutual dependencies in between tasks introduce (and exploring potential concurrency opportunities) woud quickly get haywire - so I tinkered with the idea of using pymake to use the LLM to generate a corresponding Makefile equivalent to be able to continue "building" the plan.

Turns out, that approach can be improved upon, too.
The following is an article about using an LLM to translate a plain text plan into actual markup for a planning framework, and then translating that back to human language again: https://www.marktechpost.com/2023/04/30/this-ai-paper-introduces-llmp-the-first-framework-that-incorporates-the-strengths-of-classical-planners-into-llms/

I haven't yet looked in detail at existing libs/frameworks available in Python, but this approach sounds rather interesting and like much less work in comparison to reinventing the wheel:

  • PANDA: PANDA (Python Algorithm for Navigation and Data-Association) is a Python library for multi-target tracking and planning. It provides a framework for representing objects and their trajectories, as well as algorithms for predicting future trajectories and planning paths to intercept or avoid them.
  • PyPlanner: PyPlanner is a Python library for automated planning and scheduling. It provides a way to represent tasks and their dependencies, as well as algorithms for planning and executing tasks. PyPlanner is exclusively written in Python.
  • AIMA: AIMA (Artificial Intelligence: A Modern Approach) is a widely used textbook on artificial intelligence that includes Python code examples and implementations of various AI algorithms, including planning algorithms. The book and accompanying code are available online for free.
  • PYSAT: PYSAT (Python Satellite Data Analysis Toolkit) is a Python library for working with satellite data. It includes a planning module that provides algorithms for scheduling and optimizing satellite observations.
  • PySMT: PySMT is a Python library for working with Satisfiability Modulo Theories (SMT) problems. It includes a planning module that provides algorithms for solving planning problems using SMT solvers.

The point being, with that number of Python based-options, we could just use an adapter/bridge to ask the LLM to translate a set of objectives into a plan for the corresponding planning tool, and then provide 5+ options to evaluate how well these work. Basically, treating the whole thing like a benchmark for now - adding 5+ dependencies in a dev branch should be a no-brainer, and then we'll see which of these is capable of translating most plans into actionable markup.

Personally, I would be more inclined to accept depending on an external solution - since we're already depending on OpenAI anyway.

Preliminary testing shows that GPT can translate custom plain text plans into code (markup being Python code) for these two Python frameworks:

  • Pyhop is an implementation of the Hierarchical Task Network (HTN) planning framework that provides a way to represent tasks and their dependencies, as well as methods for planning and executing tasks.
  • PDDL (Planning Domain Definition Language) is a standard language for expressing planning problems. PDDLPy is a Python library that provides PDDL parsing and planning capabilities.

So, these could be the lowest hanging fruits for now.

Between Pyhop and PDDLPy, it seems that Pyhop is the more actively maintained and developed framework.

Pyhop has a more recent release, with the most recent version (v1.2) being released in 2021. Additionally, Pyhop is being actively maintained and developed on its GitHub repository, with the most recent commit being just a few weeks ago as of May 2023. Pyhop also has a larger community of users and contributors.

@Boostrix
Copy link
Contributor

Boostrix commented May 8, 2023

I've been looking through the code to see what sort of "planning" is currently done, and it seems there really isn't much at all ? It seems it's primarily delegated to the LLM in the form of the short-bullet list as part of the prompt ?

I've seen a bunch of folks mention plans and task queues here, and that was even discussed in some PRs, but the code itself doesn't even seem to use the slightest notion of a stack/queue to push/pop job ? Care to elaborate ?

Also, any insights as to what exactly is "being planned for the planner" as part of the re-arch ?

The basic idea is to define a project using a top-level task file that includes a set of sub-tasks, where each sub-task contains a set of atomic steps. The AI agent can look at the top-level task file to determine the state of the project, and continue by looking up the index of the current sub-task and step. The AI agent can then use the LLM to execute the current atomic step and update the state of the sub-task accordingly. Once all the atomic steps of a sub-task are complete, the sub-task can be marked as complete, and the AI agent can move on to the next sub-task. This process can continue until all sub-tasks are complete, and the project is finished.

To ensure that the AI agent can resume from where it left off, metadata can be stored for each sub-task and atomic step, which can include information such as the current state, the input/output files or data, and any other relevant information. This metadata can be stored in a separate JSON file, along with any other relevant data, such as the paths to input and output files. When the AI agent resumes the project, it can read this metadata and continue from the last completed sub-task and atomic step.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

This issue has automatically been marked as stale because it has not had any activity in the last 50 days. You can unstale it by commenting or removing the label. Otherwise, this issue will be closed in 10 days.

@github-actions
Copy link
Contributor

This issue was closed automatically because it has been stale for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

6 participants