Universe environments differ from other Gym environments in that the
environment keeps running in real-time, even when the agent does not
call step
. This has a few important implications:
- Actions and observations can no longer be considered to occur on a "clock tick".
- An explicit call to
reset
is asynchronous and returns immediately, even though the environment has not yet finished resetting. (If you would prefer thereset
call to block until the reset has finished, you can wrap the client-side environment with a BlockingReset wrapper) - Since the environment will not have waited to finish
connecting to the VNC server before returning, the initial return
values from
reset
will beNone
to indicate that there is not yet a valid observation. - An agent that successfully learns from a Universe environment cannot take "thinking breaks": it must keep sending actions to the environment at all times.
- Lag and latency play a major role in your agent's ability to
successfully learn in a given environment. The latency and profiling
numbers returned in the
info
dictionary can provide important information for training.
The vectorized Gym API allows a single client-side environment to control a vector of remotes. The main difference with the non-vectorized Gym API is that individual environments will automatically reset upon reaching the end of an episode. (An episode is defined as ending when an agent has concretely succeeded or failed at the task, such as after clearing a level of a game, or losing the game. Some environments without clearly delineated success and failure conditions may not have episodes.)
There are two API methods, reset
and step
. The semantics are:
reset
takes no arguments and returns a vector of observations:
observation_n = env.reset()
step
consumes a vector of actions, and returns a vector of observations, vector of rewards, vector of done booleans, and an info dictionary. The info dictionary has ann
key, which contains a vector of infos specific to each env:
observation_n, reward_n, done_n, info = env.step(action_n)
# len(info['n']) == len(observation_n)
Some important notes:
- At any given moment, some of the environments may be
resetting. Resetting environments will have a
None
value for their observation. For example, anobservation_n
of[None, {'vision': ...}, {'vision': ...}]
indicates that the environment at index 0 is resetting. - When an index returns
done=True
, the corresponding environment will automatically start resetting. - The user must call
reset
once before callingstep
; undefined behavior will result ifreset
is not called. Furtherreset
calls are allowed, but generally are used only if the environment has been idle for a while (such as with periodic evaluation), or when it is important to start at the beginning
The remote is versioned and has fixed semantics, assuming sufficient compute resources are applied (i.e. if you don't have enough CPU, your flash environments will likely behave differently). The client's exact semantics will depend on the version of universe you have installed, and you should track the version of that together with the rest of your agent code.