{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Q-Learning Tutorial - Frozen Lake\n", "Welcome to Q-learning!\n", "Today we'll be going through how to set up and run q-learning using both the Frozen Lake and Taxi examples from OpenAI Gym.\n", "\n", "### Imports" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "from IPython import display\n", "from IPython.display import clear_output" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import numpy as np" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import torch" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import gym" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import time" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from gym.envs.registration import register\n", "register(\n", " id='FrozenLakeNotSlippery-v0',\n", " entry_point='gym.envs.toy_text:FrozenLakeEnv',\n", " kwargs={'map_name' : '4x4', 'is_slippery': False},\n", " max_episode_steps=100,\n", " reward_threshold=0.78, # optimum = .8196\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set-up\n", "Here we select which environment to use. (We'll be using FrozenLakeNotSlippery first)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# Select your environment\n", "\n", "env = gym.make('FrozenLakeNotSlippery-v0')\n", "#env = gym.make('FrozenLake-v0')\n", "#env = gym.make('Taxi-v3')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we need to find the number of states and actions possible for this environment." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "number_of_states=env.observation_space.n\n", "number_of_actions=env.action_space.n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have these values, we can use them to initialize our q-table.\n", "You can see here that we're initiliazing it with all zeroes - during training, a vector with small random values will be added to lift the degeneracy in the q-values" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Initialize q-table\n", "Q=torch.zeros([number_of_states,number_of_actions])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now for our hyper-parameters.\n", "\n", "gamma - tracks how much future reward should affect current decision making" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Set the gamma value\n", "gamma=0.95" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "learning rate - affects how strongly the current and future rewards influence the updated q-value" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# Set the learning rate\n", "learning_rate=0.9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "egreedy (epsilon) - determines to what extent newly acquired information overrides old information\n", "\n", "egreedy_final - minimum epsilon value (at which point it stops decaying)\n", "\n", "egreedy_decay - rate of decay of epsilon value" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "# Set the epsilon value\n", "egreedy=0.9\n", "egreedy_final=0.01\n", "egreedy_decay=0.999" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Training\n", "\n", "And now we can set up our training loop!\n", "\n", "Outside the loop, you can see we're setting the number of episodes to 1000. (Feel free to play around with this to see the agent's progress after different amounts of training)\n", "\n", "Action selection will be either random exploration or exploitation of the known Q-table values. The likelihood of each selection type is controlled by the epsilon parameter.\n", "\n", "After an action is chosen, we update the Q-table according to the Bellman equation described last week.\n", "\n", "After the agent reaches the goal for the first time, an animation of the agent's progress through the environment will be displayed below the code. A graphic depiction of the Q-table is also shown - as it explores more of the table, this image will change." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Episode finished after: 6\n", "Wall time: 2min 45s\n" ] } ], "source": [ "%%time\n", "\n", "x_e = []\n", "y_e = []\n", "\n", "num_episodes=1000\n", "reach_goal = False\n", "#learning_rate=i\n", "steps_total=np.full([num_episodes],-999,dtype=np.int32)\n", "rewards_total=np.full([num_episodes],-999,dtype=np.float32)\n", "\n", "for i_episode in range(num_episodes):\n", " \n", " # reset the state for the current episode\n", " state = env.reset()\n", " \n", " # keep track of how many steps have been taken in the episode\n", " step=0\n", "\n", " #undiscounted reward for the episode\n", " current_total_reward=0\n", " \n", " while True:\n", "\n", " step+=1\n", "\n", " # small random vector added to q-table values when selecting action\n", " Q_eps=1e-6*torch.randn([number_of_actions])\n", " \n", " random_for_egreedy=torch.rand(1).item()\n", " \n", " \n", " if random_for_egreedy>egreedy:\n", " # Exploitation - select largest value from q-table\n", " action=torch.argmax(Q[state]+Q_eps).item()\n", " else:\n", " # Exploration - random action selection\n", " action=env.action_space.sample()\n", " \n", " new_state, reward, done, info = env.step(action)\n", "\n", " current_total_reward+=reward\n", " \n", " if reward>0:\n", " reach_goal = True\n", " \n", " if egreedy>egreedy_final:\n", " egreedy*=egreedy_decay\n", " \n", " # Update the q-table using the Bellman equation\n", " Q[state,action]=(1.0-learning_rate)*Q[state,action]+learning_rate*(reward+gamma *torch.max(Q[new_state]).item())\n", "\n", " state=new_state\n", " clear_output(wait=True)\n", " \n", " # Once the goal has been reached, display animation of current episode every 50 episodes\n", " if reach_goal and (i_episode%50 == 0):\n", " time.sleep(0.3)\n", " print(\"------------------------\")\n", " env.render() \n", " print(\"++++++++++++++++++++++++\")\n", " print('new state: ',new_state)\n", " print('Current reward',current_total_reward)\n", " plt.figure(figsize=(5,5))\n", " #print(Q.t())\n", " plt.imshow(Q, cmap='gray', aspect='auto')\n", " #plt.colorbar()\n", " plt.show()\n", "\n", " if done:\n", " steps_total[i_episode]=step\n", " rewards_total[i_episode]=current_total_reward\n", " print(\"Episode finished after: {}\".format(step))\n", " time.sleep(0.1)\n", " break\n", " \n", "x_e.append(torch.arange(len(rewards_total)))\n", "y_e.append(rewards_total)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Results\n", "Let's take a look at the results.\n", "\n", "First, let's look at the average reward value and number of steps from our training run." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average number of steps: 6.437\n" ] } ], "source": [ "print(\"Average number of steps: {}\". format(np.average(steps_total)))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average number of steps in last 100 episodes: 6.12\n" ] } ], "source": [ "print(\"Average number of steps in last 100 episodes: {}\". format(np.average(steps_total[-100:])))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average total undiscounted reward 0.8130000233650208\n" ] } ], "source": [ "print(\"Average total undiscounted reward {}\".format(np.average(rewards_total)))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average total undiscounted reward in the last 100 episodes 1.0\n" ] } ], "source": [ "print(\"Average total undiscounted reward in the last 100 episodes {}\".format(np.average(rewards_total[-100:])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can take a look at the q-table now too. (Remember how it started with values of 0?)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[0.7351, 0.7738, 0.6983, 0.7351],\n", " [0.7351, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.7738, 0.8145, 0.0000, 0.7351],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.8145, 0.0000, 0.8574, 0.7738],\n", " [0.8145, 0.9025, 0.8145, 0.0000],\n", " [0.8574, 0.8550, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.0000, 0.0000, 0.0000],\n", " [0.0000, 0.9025, 0.9500, 0.8574],\n", " [0.9025, 0.9500, 1.0000, 0.8145],\n", " [0.0000, 0.0000, 0.0000, 0.0000]])\n" ] } ], "source": [ "print(Q)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like its action-taking has become much clearer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we're going to look at how the total reward varied by episode." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(1,figsize=[12,5])\n", "plt.title(\"Total undiscounted reward per episode\")\n", "plt.plot(torch.arange(len(rewards_total)), rewards_total,alpha=0.6, color='green')\n", "#plt.plot(rewards_total)\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It looks like after about 200 episodes it was on its way to learning to maximize its reward." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at how the number of steps it took to finish varied by episode." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(3,figsize=[12,5])\n", "plt.title(\"Steps to finish episode\")\n", "plt.plot(steps_total)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After about 200 episodes, it looks like it found an optimal way to cross the frozen lake in 6 steps." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.7" } }, "nbformat": 4, "nbformat_minor": 2 }