openai · alfred100p · Aug 19, 2021 · Aug 22, 2021 · Aug 22, 2021 · Aug 23, 2021
diff --git a/docs/algorithmic.md b/docs/algorithmic.md
@@ -0,0 +1,41 @@
+# Algorithmic Environments
+
+The unique dependencies for this set of environments can be installed via:
+
+````bash
+pip install gym[algorithmic]
+````
+
+### Characteristics
+
+Algorithmic environments have the following traits in common:
+- A 1-d "input tape" or 2-d "input grid" of characters
+- A target string which is a deterministic function of the input characters
+
+Agents control a read head that moves over the input tape. Observations consist
+of the single character currently under the read head. The read head may fall
+off the end of the tape in any direction. When this happens, agents will observe
+a special blank character (with index=env.base) until they get back in bounds.
+
+### Actions
+Actions consist of 3 sub-actions:
+- Direction to move the read head (left or right, plus up and down for 2-d
+      envs)
+- Whether to write to the output tape
+- Which character to write (ignored if the above sub-action is 0)
+
+An episode ends when:
+- The agent writes the full target string to the output tape.
+- The agent writes an incorrect character.
+- The agent runs out the time limit. (Which is fairly conservative.)
+
+Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+In the beginning, input strings will be fairly short. After an environment has
+been consistently solved over some window of episodes, the environment will
+increase the average length of generated strings. Typical env specs require
+leveling up many times to reach their reward threshold.
diff --git a/docs/algorithmic/copy.md b/docs/algorithmic/copy.md
@@ -0,0 +1,43 @@
+Copy
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Copy|Discrete|(3,)|[(0, 1),(0,1),(0,<a href="#base">base</a>-1)]|(1,)|(0,<a href="#base">base</a>)| |from gym.envs.algorithmic import copy_|
+---
+
+This task involves copying content from the input tape to the output tape. This task was originally used in the paper <a href="http://arxiv.org/abs/1511.07275">Learning Simple Algorithms from Examples</a>.
+
+The model has to learn: 
+- correspondence between input and output symbols.
+- executing the move right action on input tape.
+
+The agent take a 3-element vector for actions.
+The action space is `(x, w, v)`, where: 
+- `x` is used for left/right movement. It can take values (0,1).
+- `w` is used for writing to output tape or not. It can take values (0,1). 
+- `r` is used for selecting the value to be written on output tape.
+
+
+The observation space size is `(1,)` .
+
+**Rewards:**
+
+Rewards are issued similar to other Algorithmic Environments. Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+### Arguments
+
+```
+gym.make('Copy-v0', base=5, chars=True)
+```
+
+<a id="base">`base`</a>: Number of distinct characters to read/write.
+
+`chars`: If True, use uppercase alphabet. Otherwise, digits. Only affects rendering.
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/algorithmic/duplicated_input.md b/docs/algorithmic/duplicated_input.md
@@ -0,0 +1,43 @@
+Duplicated Input
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Duplicated Input|Discrete|(3,)|[(0, 1),(0,1),(0,<a href="#base">base</a>-1)]|(1,)|(0,<a href="#base">base</a>)| |from gym.envs.algorithmic import duplicated_input|
+---
+
+Task is to return every nth (<a href="#dup">duplication</a>) character from the input tape. This task was originally used in the paper <a href="http://arxiv.org/abs/1511.07275">Learning Simple Algorithms from Examples</a>.
+
+The model has to learn: 
+- correspondence between input and output symbols.
+- executing the move right action on input tape.
+
+The agent take a 3-element vector for actions.
+The action space is `(x, w, v)`, where: 
+- `x` is used for left/right movement. It can take values (0,1).
+- `w` is used for writing to output tape or not. It can take values (0,1). 
+- `r` is used for selecting the value to be written on output tape.
+
+
+The observation space size is `(1,)` .
+
+**Rewards:**
+
+Rewards are issued similar to other Algorithmic Environments. Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+### Arguments
+
+```
+gym.make('DuplicatedInput-v0', base=5, duplication=2)
+```
+
+<a id="base">`base`</a>: Number of distinct characters to read/write.
+
+<a id="dup">`duplication`</a>: Number of similar characters that should be converted to a single character.
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/algorithmic/repeat_copy.md b/docs/algorithmic/repeat_copy.md
@@ -0,0 +1,47 @@
+Repeat Copy
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Repeat Copy|Discrete|(3,)|[(0, 1),(0,1),(0,<a href="#base">base</a>-1)]|(1,)|(0,<a href="#base">base</a>)| |from gym.envs.algorithmic import repeat_copy|
+---
+
+
+
+This task involves copying content from the input tape to the output tape in normal order, reverse order and normal order, for example for input [x1 x2 …xk] the required output is [x1 x2 …xk xk …x2 x1 x1 x2 …xk] . This task was originally used in the paper <a href="http://arxiv.org/abs/1511.07275">Learning Simple Algorithms from Examples</a>.
+
+The model has to learn: 
+- correspondence between input and output symbols.
+- executing the move left and right action on input tape.
+
+The agent take a 3-element vector for actions.
+The action space is `(x, w, v)`, where: 
+- `x` is used for left/right movement. It can take values (0,1).
+- `w` is used for writing to output tape or not. It can take values (0,1). 
+- `r` is used for selecting the value to be written on output tape.
+
+
+The observation space size is `(1,)` .
+
+**Rewards:**
+
+Rewards are issued similar to other Algorithmic Environments. Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+
+
+### Arguments
+
+```
+gym.make('RepeatCopy-v0', base=5)
+```
+
+<a id="base">`base`</a>: Number of distinct characters to read/write.
+
+
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/algorithmic/reverse.md b/docs/algorithmic/reverse.md
@@ -0,0 +1,41 @@
+Reverse
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Reverse|Discrete|(3,)|[(0, 1),(0,1),(0,<a href="#base">base</a>-1)]|(1,)|(0,<a href="#base">base</a>)| |from gym.envs.algorithmic import reverse|
+---
+
+The goal is to reverse a sequence of symbols on the input tape. We provide a special character `r` to indicate the end of the sequence. The model must learn to move right multiple times until it hits the `r` symbol, then move to the left, copying the symbols to the output tape. This task was originally used in the paper <a href="http://arxiv.org/abs/1511.07275">Learning Simple Algorithms from Examples</a>.
+
+The model has to learn: 
+- correspondence between input and output symbols.
+- executing the move left and right action on input tape.
+
+The agent take a 3-element vector for actions.
+The action space is `(x, w, v)`, where: 
+- `x` is used for left/right movement. It can take values (0,1).
+- `w` is used for writing to output tape or not. It can take values (0,1). 
+- `r` is used for selecting the value to be written on output tape.
+
+
+The observation space size is `(1,)` .
+
+**Rewards:**
+
+Rewards are issued similar to other Algorithmic Environments. Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+### Arguments
+
+```
+gym.make('Reverse-v0', base=2)
+```
+
+<a id="base">`base`</a>: Number of distinct characters to read/write.
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/algorithmic/reversed_addition.md b/docs/algorithmic/reversed_addition.md
@@ -0,0 +1,46 @@
+Reversed Addition
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Reversed Addition|Discrete|(3,)|[(0,1,2,3),(0,1),(0,<a href="#base">base</a>-1)]|(1,)|(0,<a href="#base">base</a>)| |from gym.envs.algorithmic import reversed_addition|
+---
+
+The goal is to add <a href="#rows">"rows"</a> number of multi-digit sequences, provided on an input grid. The sequences are provided in <a href="#rows">"rows"</a> number adjacent rows, with the right edges aligned. The initial position of the read head is the last digit of the top number (i.e. upper-right corner). This task was originally used in the paper <a href="http://arxiv.org/abs/1511.07275">Learning Simple Algorithms from Examples</a>.
+
+The model has to: 
+- memorize an addition table for pairs of digits. 
+- learn how to move over the input grid.
+- discover the concept of a carry. 
+
+The agent take a 3-element vector for actions.
+The action space is `(x, w, v)`, where: 
+- `x` is used for direction of movement. It can take values (0,1,2,3).
+- `w` is used for writing to output tape or not. It can take values (0,1). 
+- `r` is used for selecting the value to be written on output tape.
+
+
+The observation space size is `(1,)` .
+
+**Rewards:**
+
+Rewards are issued similar to other Algorithmic Environments. Reward schedule:
+- write a correct character: +1
+- write a wrong character: -.5
+- run out the clock: -1
+- otherwise: 0
+
+### Arguments
+
+```
+gym.make('ReversedAddition-v0', rows=2, base=3) #for ReversedAddition
+gym.make('ReversedAddition3-v0', rows=3, base=3) #for ReversedAddition3
+gym.make('ReversedAddition-v0', rows=n, base=3) #for ReversedAddition with n numbers
+```
+
+<a id="rows">`rows`</a>: Number of multi-digit sequences to add at a time.
+
+<a id="base">`base`</a>: Number of distinct characters to read/write.
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/toy_text/blackjack.md b/docs/toy_text/blackjack.md
@@ -0,0 +1,60 @@
+Blackjack
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Blackjack|Discrete|(1,)|(0,1)|(3,)|[(0,31),(0,10),(0,1)]| |from gym.envs.toy_text import blackjack|
+---
+
+Blackjack is a card game where the goal is to obtain cards that sum to as near as possible to 21 without going over.  They're playing against a fixed dealer.
+
+Card Values:
+
+- Face cards (Jack, Queen, King) have point value 10.
+- Aces can either count as 11 or 1, and it's called 'usable ace' at 11.
+- Numerical cards (2-9) have value of their number.
+
+This game is placed with an infinite deck (or with replacement).
+The game starts with dealer having one face up and one face down card, while player having two face up cards. 
+
+The player can request additional cards (hit, action=1) until they decide to stop
+(stick, action=0) or exceed 21 (bust).
+After the player sticks, the dealer reveals their facedown card, and draws
+until their sum is 17 or greater.  If the dealer goes bust the player wins.
+If neither player nor dealer busts, the outcome (win, lose, draw) is
+decided by whose sum is closer to 21.
+
+The agent take a 1-element vector for actions.
+The action space is `(action)`, where: 
+- `action` is used to decide stick/hit for values (0,1).
+
+The observation of a 3-tuple of: the players current sum,
+the dealer's one showing card (1-10 where 1 is ace), and whether or not the player holds a usable ace (0 or 1).
+
+This environment corresponds to the version of the blackjack problem
+described in Example 5.1 in Reinforcement Learning: An Introduction
+by Sutton and Barto.
+http://incompleteideas.net/book/the-book-2nd.html
+
+**Rewards:**
+
+Reward schedule:
+- win game: +1
+- lose game: -1
+- draw game: 0
+- win game with natural blackjack: 
+
+    +1.5 (if <a href="#nat">natural</a> is True.) 
+
+    +1 (if <a href="#nat">natural</a> is False.)
+
+### Arguments
+
+```
+gym.make('Blackjack-v0', natural=False)
+```
+
+<a id="nat">`natural`</a>: Whether to give an additional reward for starting with a natural blackjack, i.e. starting with an ace and ten (sum is 21).
+
+### Version History
+
+* v0: Initial versions release (1.0.0)
diff --git a/docs/toy_text/frozen_lake.md b/docs/toy_text/frozen_lake.md
@@ -0,0 +1,70 @@
+Frozen Lake
+---
+|Title|Action Type|Action Shape|Action Values|Observation Shape|Observation Values|Average Total Reward|Import|
+| ----------- | -----------| ----------- | -----------| ----------- | -----------| ----------- | -----------|
+|Frozen Lake|Discrete|(1,)|(0,3)|(1,)|(0,nrows*ncolumns)| |from gym.envs.toy_text import frozen_lake|
+---
+
+
+Frozen lake involves crossing a frozen lake from Start(S) to goal(G) without falling into any holes(H). The agent may not always move in the intended direction due to the slippery nature of the frozen lake.
+
+The agent take a 1-element vector for actions.
+The action space is `(dir)`, where `dir` decides direction to move in which can be:
+
+- 0: LEFT
+- 1: DOWN
+- 2: RIGHT
+- 3: UP 
+
+The observation is a value representing the agents current position as
+
+    current_row * nrows + current_col
+
+**Rewards:**
+
+Reward schedule:
+- Reach goal(G): +1
+- Reach hole(H): 0
+
+### Arguments
+
+```
+gym.make('FrozenLake-v0', desc=None,map_name="4x4", is_slippery=True)
+```
+
+`desc`: Used to specify custom map for frozen lake. For example,
+
+    desc=["SFFF", "FHFH", "FFFH", "HFFG"].
+
+`map_name`: ID to use any of the preloaded maps.
+
+    "4x4":[
+        "SFFF", 
+        "FHFH", 
+        "FFFH", 
+        "HFFG"
+        ]
+
+    "8x8": [
+        "SFFFFFFF",
+        "FFFFFFFF",
+        "FFFHFFFF",
+        "FFFFFHFF",
+        "FFFHFFFF",
+        "FHHFFFHF",
+        "FHFFHFHF",
+        "FFFHFFFG",
+    ]
+
+
+
+
+`is_slippery`: True/False. If True will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions.
+
+    For example, if action is left and is_slippery is True, then:
+    - P(move left)=1/3
+    - P(move up)=1/3
+    - P(move down)=1/3 
+### Version History
+
+* v0: Initial versions release (1.0.0)