1999 Volume 14 Issue 1 Pages 122-130
Many previous works in reinforcement learning (RL) are limited to Markov decision processes (MDPs). However, a great many real-world applications do not satisfy this assumption. RL tasks of real world can be characterized by two difficulties : Function approximation and hidden state problems. For large and continuous state or action space, the agent has to incorporate some form of generalization. One way to do it is to use general function approximators to represent value functions or control policies. Hidden state problems, which can be represented by partially observable MDPs (POMDPs), arise in the case that the RL agent cannot observe the state of the environment perfectly owing to noisy or insufficient sensors, partial information, etc. We have presented a RL algorithm in POMDPs, that is based on a stochastic gradient ascent. It uses function approximator to represent a stochastic policy, and updates the policy parameters. We apply the algorithm to a robot control problem, and show the features in comparison with Q-learning or Jaakkola's method. The results shows the algorithm is very robust under the conditions that the agent is restricted to computationally very poor resources.