🕵️ Introduction
Consider a problem of a taxi driver, who serves six cities 0, 1, 2, 3, 4 and 5 which are located on a circular highway. The taxi driver can choose one of the following actions.
1. Cruise the streets looking for a passenger.
2. Go to the nearest taxi stand and wait in line. 3. Head back to headquarters (city 0).
Any passenger will only go to a preceding or succeeding city and the driver never cancels a ride. For a given city and any of the first two actions, there is a probability that the driver gets a passenger and goes to the succeeding or preceding city and there is also a probability that the driver doesn’t get any passenger and stays in the same city. Refer Table for the transition probabilities pkij which are represented as pkc where c = j − i.
The rewards for outcomes 1 and 1 for actions 1 and 2 are shown in Figure. When the driver doesn’t get a passenger in a city (c=0), then the reward for the driver is 0. Also, if the driver decides to go to headquarters, the reward is 0.
Suppose 1 − γ is the probability that the taxi will breakdown before the next trip. The driver’s goal is to maximize the total reward until his taxi breakdown
📝 Task
Implement the following: (2+1.5 marks)
 Find an optimal policy using policy iteration starting with a policy that will always cruise independent of the town, and a zero value vector. Let γ = 0.9.
 Run policy iteration for discount factors γ ranging from 0 to 0.95 with intervals of 0.05 and display the results.
Answer the following (based on the data given above): (1+0.5 mark)
 How is different values of γ affecting the policy iteration? Explain your findings.
 Give alternate transition probabilities for action 2(if exists) such that optimal policy consists of action 2. Explain your answer.
You will be writing your solutions & making a submission through a notebook. You can follow the instructions in the starter notebook.
💾 Dataset
Under the Resources
section you will find data files that contains parameters for the environment for this problem.
🚀 Submission
 Submissions will be made through a notebook following the instructions in the starter notebook.
 Each Team can make 5 successful submissions and 5 failed submissions in a day. Once the limit of failed submission is reached, the submission will be counted in the successful submission.
 The submission limit will reset at 5:30 AM IST every day.
 At the end of the challenge, you will have to select 1 submission as the final one. You can select that here.
📱 Contact
 RL TAs
Notebooks
1

0
