π΅οΈ Introduction
In this problem you will be implementing value iteration on a 10 Γ 10 gridworld based on the actions, rewards and the state space. Consider a 10 Γ 10 gridworld (GridWorld-1) as shown in the Figure 1:
Figure 1: GridWorld-1
- State space: Gridworld has
100
distinct states. There is a special absorbing state called Goal. There are two wormholes labeled as IN in color Grey and Brown, any action taken in those states will teleport you to state labeled βOUTβ in color Grey and Brown respectively. States labeled OUT are just a normal state. - Action space: There are 4 actions A =
{North, East, South, West}
, which moves you one cell in the corresponding direction. - Transition model: The Gridworld is stochastic. In this model, an action X β
{North, East, South, West}
moves you one cell in the X direction of your current position with probability0.8
, but with probabilities0.1
and0.1
moves you one cell at angels of90β¦
andβ90β¦
to the direction X, respectively (Refer Table 1 for more details). For example, if the selected action is North, it will transition you one cell to the North of your current position with probability0.8
, one cell to the East of your current position with probability0.1
and one cell to the West of your current position with probability0.1
. Transitions that take you off the grid will not result in any state change. There are no transitions available once you reach the Goal state
Table 1
- Rewards: You will receive a reward of
β1
for all transitions (including the one that take you off the grid) except the transition to the Goal state. Any transition to the Goal state gives you a reward of+100
.
Instructions
- Implement:
- value iteration. Find the pseudocode below. Let S be the state space and A the action space:
- a greedy policy w.r.t Ji as,
- value iteration. Find the pseudocode below. Let S be the state space and A the action space:
- The value iteration loop goes to infinity (refer the pseudocode given above), so when would you stop your value iteration?
- Plot a graph of max sβS |Ji+1 (s) β Ji(s)| vs iterations.
- Tabulate the values of J(s) and greedy policy Ο(s), βs β S, after 10 iterations, 25 iterations, and after you stop the value iteration.
- Consider a new gridworld (GridWorld-2) as shown Figure 2. Compare and contrast the
behavior of J and greedy policy Ο for GridWorld-1 and GridWorld-2.
Figure 2: GridWorld-2
You will be writing your solutions & making a submission through a notebook. You can follow the instructions in the starter code.
π Files
Under the Resources
section you will find data files that contains parameters for the environment for this problem.
π Submission
Submissions will be made through a notebook following the instructions in the starter code.
π± Contact
- RL TAs
Notebooks
0
|
0
|