Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Agent, Action, Observation Abstraction and with updated Controller #105

Merged
merged 62 commits into from
Mar 25, 2024

Conversation

xingyaoww
Copy link
Collaborator

No description provided.

return {
"action_type": self.__class__.__name__,
"args": self.__dict__
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each action (optionally) takes in a controller and handles execution by itself.

"""

_registry: Dict[str, Type['Agent']] = {}

def __init__(
self,
instruction: str,
workspace_dir: str,
model_name: str,
max_steps: int = 100
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get rid of workspace_dir and max_steps since these are handled by the Controller now.

pass

@abstractmethod
def step(self, cmd_mgr: CommandManager) -> Event:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get rid of add_event method, with the hope that we can put all these "added events" into State, which is passed to agent's .step method, with the hope that it simplifies the workflow and make the code easier for people to understand?


@abstractmethod
def step(self, cmd_mgr: CommandManager) -> Event:
def step(self, state: State) -> Action:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"state --- agent ---> action" is a commonly used paradigm in RL i think, conforming to this convention make help people understand what's going on here easily.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

step(state) makes much more sense!

@@ -1,49 +0,0 @@
from opendevin.lib.command_manager import CommandManager
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this file to a separate folder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I move this file under controller since i realized that this command_manager is not really used in other places, rather, its invocation mostly stays in the Controller.

That is:

  • Agent: an LLM that takes a State and Spit out an Action
  • Controller: responsible for putting State into Agent, and executing its Action (where command manager is involved) & put the results back into State

controller = AgentController(agent, args.directory)
controller = AgentController(
agent, workdir=args.directory, max_iterations=args.max_iterations
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just move the directory and max_iteration arguments from AgentCls to Controller

@xingyaoww xingyaoww requested a review from rbren March 23, 2024 05:54
@xingyaoww
Copy link
Collaborator Author

Hey @rbren, I tweaked the abstraction a bit; hope to get your feedback on how you think about this before I move any further (converting langchains agent and codeact to conform with this). My hope is to potentially make this abstraction simpler and easier for everyone to contribute (e.g., by conforming to the state-agent->action convention).

def to_dict(self):
return {
"action_type": self.__class__.__name__,
"args": self.__dict__
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recently added a message field for all Actions (Events in the old system). Might be worth keeping

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you merge this one already though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from ..controller import AgentController

@dataclass
class Action:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this class replacing Event? Or living alongside it?

I've been using Event to track:

  • Actions
  • Action Outputs (logs, file contents, webpages)
  • User messages
  • Thinking in between actions

It's important for the agent to see a stream of these Events (so it can react appropriately), and for the controller to emit a stream of Events (e.g. to show them in the browser).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually thinking of replacing Event with Action:

  • If we want to stream thinking between Actions to the user, you could emit NoopAction (pure text message for thinking).
  • If we want to stream information to the Agent, we can do so by updating the State, and feed into the agent? If an action is still running, the agent would be blocked in current implementation so that we would not be able to interrupt it with Event as well?

I think we have some trade-off on the decision here: The advantage is that we don't need to really keep Action and Event as two seemingly similar but different things (which may confuse people?) - the downside would be the control loop might need to run more frequently, but I think it will get the job down?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the downside would be the control loop might need to run more frequently, but I think it will get the job down?

Ah, so are you thinking the loop would be more like:

  • step 1: run ls
  • step 2: output of ls
  • step 3: read file foo.txt
  • step 4: foo contents

instead of the current loop, which combines actions and their outputs into a single step:

  • step 1 : run ls, output of ls
  • step 2: read file foo.txt, foo contents

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it roughly look like this:

for i in range(max_iter):
    state = ...
    # iter 2: state has the output of ls
    action = agent.step(state)
    # iter 1: action == `run ls`
    # iter 2: action == "read file foo.txt"

    observation = action.run(self) # self == controller
    # iter 1: observation == output of ls
    # iter 2: observation == "content of foo"

    state = state.update(observation)

Copy link
Collaborator Author

@xingyaoww xingyaoww Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also allows us to interrupt the control loop easier (since they run on higher frequency). Also, this frequency also corresponds to LLM's API calling frequency (one step corresponds to 1 LLM call), which could potentially be easier for calculating stats?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I like this

from typing import Mapping

@dataclass
class State:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should keep the Event/Action history here too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! will add


@dataclass
class State:
background_commands: Mapping[int, str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is str here the command that was run?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - as your originally wrote i guess?

Comment on lines 55 to 59
# NOTE: Backgorund events are now part of the State
# log_events = self.command_manager.get_background_events()
# for event in log_events:
# for callback in self.callbacks:
# callback(event)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have background_commands in the state, but we still need to get the logs for those commands intermittently.

# TODO: Make all these Log Events into State so that we can get rid of the Event all together

action: Action = self.agent.step(self.state)
observation: str = action.run(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does observation get passed back to the agent? In State?

How does it get passed to the client? I was using callback, but open to other options.

I do think we'll want structured data in here, like exit_code or http_status

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! In State. In the current implementation of the control loop, if I am reading this correctly, even if we stream multiple events to the Agent, the agent would still be blocked at the previous job before they can handle the next event.
Instead of dispatching events to agents in real-time (like the add_event), we could potentially just maintain a list of Event/Observation history and pass it all together to the Agent when they finished with their previous action?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

How will the agent collate Actions and Observations? E.g. how does it know the order of:

  • I did this runcommand action
  • Then I saw output X
  • Then I read file foo.txt
  • Then I saw output Y
  • Then user told me to change tactics
  • Then I thought Z

Does state keep a single array of those things?

Copy link
Collaborator Author

@xingyaoww xingyaoww Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes -- that's my plan, State just keep a single array of those histories, and pass it to the agent every turn (i.e., just like the messages argument for OpenAI API - you have to pass in the complete history to the model for it to emit the next response).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great! So we have an array like the above--I assume they're all the same class? Are they all Actions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! Or just a list of Action and Observation (- both could be converted to string, or we could make Observation a separate class, but I'm still debating atm)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having Action and Observation in the same list might be a little tough, unless they're both subclasses of the same abstraction. Otherwise the client will have to introspect each one to figure out if they're looking at an Action or an Observation.

I do think we should keep the data as structured as possible--converting to strings will lose information and make it harder to craft good prompts.

For example, when an error observation occurs, I might want to add a hint to my prompt, saying "your last command failed with exit code $code. We saw these logs: $logs. Can you fix it?"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make a lot of sense! I'll figure it out :) Will let know when i finished this one!

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall! I like the idea of creating an abstract Action class and subclassing it.

I think you're right that we can probably combine step and add_event. We could add the observation history into State to achieve this. But now the agent needs to keep track of what the newest observations are, instead of getting a stream of them in order. Any ideas on how to make that easier?

Also--how do you see user interrupts getting passed into the agent?

@xingyaoww
Copy link
Collaborator Author

@rbren Thanks!! This is just an initial draft and by no means the final version -- will try to get this completed (fixed with langchains agent and codeact) before next Monday.

We could add the observation history into State to achieve this. But now the agent needs to keep track of what the newest observations are,

I think we can handle this in the control loop and can add attributes in State something like observations, which is a list presenting Observations in chronological order?

"action": event.__class__.__name__,
"args": event.__dict__,
"message": event.message,
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbren Fix most issues related to the new backend server and should be ready for review!
But I could need help here -- what would the message generally contain? Do we have a list of different events/actions we supported with the front-end? Thanks :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping we can use the Action.to_dict method to make this cleaner. We'll need a robust way to serialize/deserialize actions and observations so we can send them back and forth.

Right now, every payload has three fields:

  • action: the ID of the action (could be something passive, like output)
  • args: any extra data about the action. key-value pairs, but something we'll want to standardize. E.g. the run action has a command arg, and the output action has an output arg.
  • message: a human-readable description of this action, to be put in the chat window

But it might make sense to split Actions and Observations with two distinct payloads. Maybe something like:

{
  "action": "run",
  "args": {"command": "ls"},
  "message": "let's check what files are available"
}

and

{
  "observation": "run",
  "content": "foo.txt\nbar.txt",
  "extras": {
    "exit_code": 0
  },
  "message": "I see two text files."
}

Copy link
Collaborator Author

@xingyaoww xingyaoww Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I totally agree! I probably need to read more about how front-end use these event - I can update the PR to do something like this:

{
"type": "observation OR action",
"name": "name_of_observation_or_action",
"arguments": {"kwargs for actions"},
"content": "TEXT CONTENT FOR OBS, null for action",
"message": 
"Human readable description of the action - not necessarily need to be model generated",
"extras": {
"EXTRA KEY" : "EXTRA Value"
}
}

Copy link
Collaborator

@rbren rbren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along really well!

The thing I'm most concerned about is serializing and deserializing these actions and observations. That will be really important for client/server communication, and for agent/LLM communication.

Having each type be its own class makes things a lot better in Python, but we'll have to work some magic to get them serialized nicely.

Things to consider:

  • Being consistent with how we serialize different actions will make it easier for clients to consume them (e.g. in JavaScript)
  • Being concise will help conserve token usage (e.g. see my comment on action vs action_type below)

I think you have a good start with the to_dict method on Action--maybe if we put some more logic in there, we can remove the conversions that are currently in the langchains agent and the server

class AgentRecallAction(ExecutableAction):
query: str

def run(self, controller: "AgentController") -> AgentRecallObservation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the types supposed to be quoted here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh because AgentController is imported only for type-checking here, so we should quote the type there (but it will show up correctly in Python IDE). If we are not import these for type-checking, it will create circular import that breaks the Syntax check i think.

Comment on lines +11 to +12
def run(self, controller: "AgentController") -> "Observation":
raise NotImplementedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is run a method on the base action, rather than on ExecutableAction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good point - i should move this down!

Copy link
Collaborator Author

@xingyaoww xingyaoww Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i realize the reason i did that is for python's static analysis to infer the correct .run method in our control loop, when we are expecting Action type as outputs of the agent (instead of ExecutableAction. But it shouldn't hurt? - .run should not be called for a NotExecutableAction anyway.

elif isinstance(info, CmdKillAction):
d = {"action": "kill", "args": {"id": info.id}}
elif isinstance(info, BrowseURLAction):
d = {"action": "browse", "args": {"url": info.url}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Action <-> JSON conversion is going to be very important, so it's probably something we want to abstract.

For one, the server and client are going to need to send Actions and Observations back and forth, and will need to serialize them. Ditto for agent <-> LLM interactions.

Could d = info.to_dict() simplify this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so! But I feel we should probably prioritize getting this abstraction merged, then start a separate PR to re-write the inner workings of langchains agent completely to adopt this Action and Observation (and the serialization via to_dict) -- otherwise it will be pretty challenge for me to keep chasing with all the new commits keeping pushing to main everyday hahah

self.max_iterations = max_iterations
self.workdir = workdir
self.command_manager = CommandManager(workdir)
self.state_updated_info: List[Action | Observation] = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a thought: what if it's a List[(Action, Observation)]?

We'd need to add a NullObservation for actions that aren't executable. But it makes the list a lot more predictable, and it makes it very clear which observation goes with which action.

It'd also help us avoid all the isinstance logic 😄

Copy link
Collaborator Author

@xingyaoww xingyaoww Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one issue we are having right now is that user can actually send over their action to the control loop, which might make Tuple[Action, Observation] challenge. Should agent always treat that as an observation? Or do we always enforce that the only thing that comes from the front-end can only be Observation (which will simplify things by a ton!!). I think this is what Devin is doing - they only allow the agent to do all the work, and the user can just sit back and watch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm...interesting question.

I'd actually say that the user input should always be an Action. Messages could be something like the think action we currently have. Maybe hint?

But I imagine the user could also run a command to help the agent (e.g. maybe it's struggling to figure out apt-get install npm)

Copy link
Collaborator

@enyst enyst Mar 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW there's some video with Devin struggling with a task, and the user saying "wait, don't do that, do [something else]".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as an aside, I think we'll want to have an API endpoint that returns the history, so the user can see everything happening (at least for debugging purposes).

Giving the items in the history a consistent format will help a bunch there

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a research perspective, I guess allowing the user to execute a command in Devin's execution environment naturally makes the agent's environment change from static to dynamic, which might pose a bunch of challenges in both implementations and research. To the best of my knowledge, most agent research today mostly assumes a static environment: very few agent benchmarks are tailored for temporal changes, let alone a dynamic environment.

So I think maybe it will be beneficial for us to focus on the dynamic environment first (user just watch and talk, but do not execute actions themselves). Maybe after we get everything work smoothly and our agents can solve SWE-Bench issues, then we can start studying this problem with dynamic environment, how do y'all think?

PLUS: i think it will be very nice to keep track of history using Controller, we can starts a new issue to call for help after this PR is merged.

raise NotImplementedError

def to_dict(self):
return {"action_type": self.__class__.__name__, "args": self.__dict__}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why you made this action_type instead of action, but it might be easier if we just call it action--it saves a token if the dict gets passed to an LLM, and is a little easier to read.

return exit_code, logs.decode('utf-8')

def execute_in_background(self, cmd: str) -> None:
self.log_time = time.time()
result = self.container.exec_run(['/bin/bash', '-c', cmd], socket=True, workdir="/workspace")
result = self.container.exec_run(['su', 'devin', '-c', cmd], socket=True, workdir="/workspace")
self.log_generator = result.output # socket.SocketIO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 we'll probably want to rework this at some point. I've noticed it often has issues installing software.

"action": event.__class__.__name__,
"args": event.__dict__,
"message": event.message,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hoping we can use the Action.to_dict method to make this cleaner. We'll need a robust way to serialize/deserialize actions and observations so we can send them back and forth.

Right now, every payload has three fields:

  • action: the ID of the action (could be something passive, like output)
  • args: any extra data about the action. key-value pairs, but something we'll want to standardize. E.g. the run action has a command arg, and the output action has an output arg.
  • message: a human-readable description of this action, to be put in the chat window

But it might make sense to split Actions and Observations with two distinct payloads. Maybe something like:

{
  "action": "run",
  "args": {"command": "ls"},
  "message": "let's check what files are available"
}

and

{
  "observation": "run",
  "content": "foo.txt\nbar.txt",
  "extras": {
    "exit_code": 0
  },
  "message": "I see two text files."
}

@xingyaoww
Copy link
Collaborator Author

xingyaoww commented Mar 25, 2024

@rbren Can confirm the latest code now works on the minimal example for both langchains and codeact agents.

image image

However, there could be something that could break other things (e.g., front-end). I'd hope we can get this merged soon and start a series of issues to keep track of new issues that may come with the new abstraction (and save me temporarily from trying to solve merge conflicts every day haha), for example:

  • refractor langchains agent to this new abstraction
  • support parsing from end observations (e.g., user messages) by adding from_dict method for Observation (potentially for Action in the future - if we want to make the agent work with the user at the same time)
  • make sure we also change the message key for the front-end to support performing multi-turn chat
  • remove all the print statement inside individual agent implementation, and do a systematic colored action/observation printing in the controller
  • keep track of the entire (action, observation) history in the controller for the front-end to use when needed

@xingyaoww xingyaoww changed the title A draft of new Agent & Action Abstraction and Their Interaction with Controller New Agent, Action, Observation Abstraction and with updated Controller Mar 25, 2024
@xingyaoww xingyaoww merged commit 82f934d into All-Hands-AI:main Mar 25, 2024
@xingyaoww xingyaoww mentioned this pull request Mar 26, 2024
xcodebuild pushed a commit to xcodebuild/OpenDevin that referenced this pull request Mar 31, 2024
…ll-Hands-AI#105)

* rearrange workspace_dir and max_step as arguments to controller

* remove unused output

* abstract each action into dataclass

* move actions

* fix action import

* move cmd manager and change method to private

* move controller

* rename action folder

* add state

* a draft of Controller & new agent abstraction

* add agent actions

* remove controller file

* add observation to perform a refractor on langchains agent

* revert to make this compatible via translation

* fix typo and translate error

* add error to observation

* index thought as dict

* refractor controller

* fix circular dependency caused by type hint

* add runnable attribute to agent

* add mixin to denote executable

* change baseclass

* make file read/write action compatible w/ docker directory

* remove event

* fix some merge issue

* fix sandbox w/ permission issue

* cleanup history abstraction since langchains agent is not really using it

* tweak to make langchains agent working

* make all actions return observation

* fix missing import

* add echo action for agent

* add error code to cmd output obs

* make cmd manager returns cmd output obs

* fix codeact agent to make it work

* fix all ruff issue

* fix mypy

* add import agenthub back

* add message for Action attribute (migrate from previous event)

* fix typo

* fix instruction setting

* fix instruction setting

* attempt to fix session

* ruff fix

* add .to_dict method for base and observation

* add message for recall

* try to simplify the state_updated_info with tuple of action and obs

* update_info to Tuple[Action, Observation]

* make codeact agent and langchains compatible with Tuple[Action, Observation]

* fix ruff

* fix ruff

* change to base path to fix minimal langchains agent

* add NullAction to potentially handle for chat scenario

* Update opendevin/controller/command_manager.py

Co-authored-by: Robert Brennan <[email protected]>

* fix event args

* set the default workspace to "workspace"

* make directory relative (so it does not show up to agent in File*Action)

* fix typo

* await to yield for sending observation

* fix message format

---------

Co-authored-by: Robert Brennan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants