Loading

Hockey Team Classification

[Starter Notebook] RL - Taxi Problem

This a getting started notebook for the Taxi Problem in the RL course.

S.Rathi

This is a getting starter notebook for the Taxi Problem. It contains basic instructions for using the notebook to make submissions as well as listed tasks to perform & questions to answer. Please read the instruction carefully and then proceed. You are required to create a copy of it before start playing with it.

Happy Solving!😀

AIcrowd

What is the notebook about?

Problem - DP Algorithm

This problem deals with a taxi driver with multiple actions in different cities. The tasks you have to do are:

  • Implement DP Algorithm to find the optimal sequence for the taxi driver
  • Find optimal policies for sequences of varying lengths
  • Explain a variation on the policy

How to use this notebook? 📝

  • This is a shared template and any edits you make here will not be saved. You should make a copy in your own drive. Click the "File" menu (top-left), then "Save a Copy in Drive". You will be working in your copy however you like.

notebook overview

  • Update the config parameters. You can define the common variables here
Variable Description
AICROWD_DATASET_PATH Path to the file containing test data. This should be an absolute path.
AICROWD_RESULTS_DIR Path to write the output to.
AICROWD_ASSETS_DIR In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
AICROWD_API_KEY In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [ ]:
!pip install -U git+https://gitlab.aicrowd.com/aicrowd/aicrowd-cli.git@notebook-submission-v2 > /dev/null
In [ ]:
%load_ext aicrowd.magic

AIcrowd Runtime Configuration 🧷

Define configuration parameters. Please include any files needed for the notebook to run under ASSETS_DIR. We will copy the contents of this directory to your final submission file 🙂

In [ ]:
import os

AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", os.getcwd()+"/40746340-4151-4921-8496-be10b3f8f5cf_hw2_q1.zip")
AICROWD_RESULTS_DIR = os.getenv("OUTPUTS_DIR", "results")
API_KEY = "" #Get your API key from https://www.aicrowd.com/participants/me

Download dataset files 📲

In [ ]:
!aicrowd login --api-key $API_KEY
!aicrowd dataset download -c rl-taxi
In [ ]:
!unzip -q $AICROWD_DATASET_PATH
In [ ]:
DATASET_DIR = 'hw2_q1/'
!mkdir {DATASET_DIR}results/

Install packages 🗃

Please add all pacakage installations in this section

In [ ]:

Import packages 💻

In [ ]:
import numpy as np
import os
# ADD ANY IMPORTS YOU WANT HERE
In [ ]:
import numpy as np

class TaxiEnv_HW2:
    def __init__(self, states, actions, probabilities, rewards):
        self.possible_states = states
        self._possible_actions = {st: ac for st, ac in zip(states, actions)}
        self._ride_probabilities = {st: pr for st, pr in zip(states, probabilities)}
        self._ride_rewards = {st: rw for st, rw in zip(states, rewards)}
        self._verify()

    def _check_state(self, state):
        assert state in self.possible_states, "State %s is not a valid state" % state

    def _verify(self):
        """ 
        Verify that data conditions are met:
        Number of actions matches shape of next state and actions
        Every probability distribution adds up to 1 
        """
        ns = len(self.possible_states)
        for state in self.possible_states:
            ac = self._possible_actions[state]
            na = len(ac)

            rp = self._ride_probabilities[state]
            assert np.all(rp.shape == (na, ns)), "Probabilities shape mismatch"
        
            rr = self._ride_rewards[state]
            assert np.all(rr.shape == (na, ns)), "Rewards shape mismatch"

            assert np.allclose(rp.sum(axis=1), 1), "Probabilities don't add up to 1"

    def possible_actions(self, state):
        """ Return all possible actions from a given state """
        self._check_state(state)
        return self._possible_actions[state]

    def ride_probabilities(self, state, action):
        """ 
        Returns all possible ride probabilities from a state for a given action
        For every action a list with the returned with values in the same order as self.possible_states
        """
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_probabilities[state][ac_idx]

    def ride_rewards(self, state, action):
        actions = self.possible_actions(state)
        ac_idx = actions.index(action)
        return self._ride_rewards[state][ac_idx]

Examples of using the environment functions

In [ ]:
def check_taxienv():
    # These are the values as used in the pdf, but they may be changed during submission, so do not hardcode anything

    states = ['A', 'B', 'C']

    actions = [['1','2','3'], ['1','2'], ['1','2','3']]

    probs = [np.array([[1/2,  1/4,  1/4],
                    [1/16, 3/4,  3/16],
                    [1/4,  1/8,  5/8]]),

            np.array([[1/2,   0,     1/2],
                    [1/16,  7/8,  1/16]]),

            np.array([[1/4,  1/4,  1/2],
                    [1/8,  3/4,  1/8],
                    [3/4,  1/16, 3/16]]),]

    rewards = [np.array([[10,  4,  8],
                        [ 8,  2,  4],
                        [ 4,  6,  4]]),
    
            np.array([[14,  0, 18],
                        [ 8, 16,  8]]),
    
            np.array([[10,  2,  8],
                        [6,   4,  2],
                        [4,   0,  8]]),]


    env = TaxiEnv_HW2(states, actions, probs, rewards)
    print("All possible states", env.possible_states)
    print("All possible actions from state B", env.possible_actions('B'))
    print("Ride probabilities from state A with action 2", env.ride_probabilities('A', '2'))
    print("Ride rewards from state C with action 3", env.ride_rewards('C', '3'))

check_taxienv()

Task 1 - DP Algorithm implementation

Implement your DP algorithm that takes the starting state and sequence length and return the expected reward for the policy

In [ ]:
def dp_solve(taxienv):
    ## Implement the DP algorithm for the taxienv
    states = taxienv.possible_states
    values = {s: 0 for s in states}
    policy = {s: '0' for s in states}
    all_values = [] # Append the "values" dictionary to this after each update
    all_policies = [] # Append the "policy" dictionary to this after each update
    # Note: The sequence length is always N=10
    
    # ADD YOUR CODE BELOW - DO NOT EDIT ABOVE THIS LINE

    # DO NOT EDIT BELOW THIS LINE
    results = {"Expected Reward": all_values, "Polcies": all_policies}
    return results

Here is an example of what the "results" output from value_iter function should look like

Ofcourse, it won't be all zeros

{'Expected Reward': [{'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0},
  {'A': 0, 'B': 0, 'C': 0}],
 'Polcies': [{'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'},
  {'A': '0', 'B': '0', 'C': '0'}]}
In [ ]:
if not os.path.exists(AICROWD_RESULTS_DIR):
  os.mkdir(AICROWD_RESULTS_DIR)
In [ ]:
# DO NOT EDIT THIS CELL, DURING EVALUATION THE DATASET DIR WILL CHANGE
input_dir = os.path.join(DATASET_DIR, 'inputs')
for params_file in os.listdir(input_dir):
  kwargs = np.load(os.path.join(input_dir, params_file), allow_pickle=True).item()

  env = TaxiEnv_HW2(**kwargs)

  results = dp_solve(env)
  idx = params_file.split('_')[-1][:-4]
  np.save(os.path.join(AICROWD_RESULTS_DIR, 'results_' + idx), results)
In [ ]:
## Modify this code to show the results for the policy and expected rewards properly
print(results)

Task 2 - Tabulate the optimal policy & optimal value for each state in each round for N=10

Modify this cell and add your answer

Question - Consider a policy that always forces the driver to go to the nearest taxi stand, irrespective of the state. Is it optimal? Justify your answer.

Modify this cell and add your answer

Submit to AIcrowd 🚀

NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)

In [ ]:
!DATASET_PATH=$AICROWD_DATASET_PATH aicrowd notebook submit -c rl-taxi -a assets
In [ ]:


Comments

You must login before you can post a comment.

Execute