1. 组成
OpenAI Gym由两部分组成:
- gym开源库:测试问题的集合。当你测试强化学习的时候,测试问题就是环境,比如机器人玩游戏,环境的集合就是游戏的画面。这些环境有一个公共的接口,允许用户设计通用的算法。
- OpenAI Gym服务:提供一个站点(比如对于游戏cartpole-v0:https://gym.openai.com/envs/CartPole-v0)和api,允许用户对他们的测试结果进行比较。
2. 接口
gym的核心接口是Env,作为统一的环境接口。Env包含下面几个核心函数:
-
reset(self)
:重置环境的状态,返回观测。 -
step(self, action)
:物理引擎,向前推进一个时间步长,返回observation,reward,done,info
-
render(self, mode=’human’, close=False)
:图像引擎,重绘环境的一帧。默认模式一般比较友好,如弹出一个窗口。
3. 注册自己的模拟器
- 目标是在注册表中注册自己的环境。假设你在以下结构中定义了自己的环境:
myenv/
__init__.py
myenv.py
-
myenv.py
包含适用于我们自己的环境的类。 在init.py
中,输入以下代码:
from gym.envs.registration import register
register(
id='MyEnv-v0',
entry_point='myenv.myenv:MyEnv', # 第一个myenv是文件夹名字,第二个myenv是文件名字,MyEnv是文件内类的名字
)
- 要使用我们自己的环境:
import gym
import myenv # 一定记得导入自己的环境,这是很容易忽略的一点
env = gym.make('MyEnv-v0')
- 在PYTHONPATH中安装
myenv
目录或从父目录启动python。
目录结构:
myenv/
__init__.py
my_hotter_colder.py
-------------------
__init__.py 文件:
-------------------
from gym.envs.registration import register
register(
id='MyHotterColder-v0',
entry_point='myenv.my_hotter_colder:MyHotterColder',
)
-------------------
my_hotter_colder.py文件:
-------------------
import gym
from gym import spaces
from gym.utils import seeding
import numpy as np
class MyHotterColder(gym.Env):
"""Hotter Colder
The goal of hotter colder is to guess closer to a randomly selected number
After each step the agent receives an observation of:
0 - No guess yet submitted (only after reset)
1 - Guess is lower than the target
2 - Guess is equal to the target
3 - Guess is higher than the target
The rewards is calculated as:
(min(action, self.number) + self.range) / (max(action, self.number) + self.range)
Ideally an agent will be able to recognise the 'scent' of a higher reward and
increase the rate in which is guesses in that direction until the reward reaches
its maximum
"""
def __init__(self):
self.range = 1000 # +/- value the randomly select number can be between
self.bounds = 2000 # Action space bounds
self.action_space = spaces.Box(low=np.array([-self.bounds]), high=np.array([self.bounds]))
self.observation_space = spaces.Discrete(4)
self.number = 0
self.guess_count = 0
self.guess_max = 200
self.observation = 0
self.seed()
self.reset()
def seed(self, seed=None):
self.np_random, seed = seeding.np_random(seed)
return [seed]
def step(self, action):
assert self.action_space.contains(action)
if action < self.number:
self.observation = 1
elif action == self.number:
self.observation = 2
elif action > self.number:
self.observation = 3
reward = ((min(action, self.number) + self.bounds) / (max(action, self.number) + self.bounds)) ** 2
self.guess_count += 1
done = self.guess_count >= self.guess_max
return self.observation, reward[0], done, {"number": self.number, "guesses": self.guess_count}
def reset(self):
self.number = self.np_random.uniform(-self.range, self.range)
self.guess_count = 0
self.observation = 0
return self.observation
参考: