1999 Volume 14 Issue 1 Pages 148-156
Partially Observable Markov Decision Process (POMDP) is a representative class of non-Markovian environments, where agents sense different environmental states as the same sensory input. We recognize that full implementation of POMDPs must overcome two deceptive problems. We call confusion of state values a Type 1 deceptive problem and indistinction of rational and irrational rules a Type 2 deceptive problem. The Type 1 problem deceives Q-learning, the most widely-used method in which state values are estimated. Though Profit Sharing that satisfies Rationality Theorem [Miyazaki 94] is not deceived by Type 1 problem, it cannot overcome a Type 2 problem. A current approach to POMDPs is classified into two types. One is the memory-based approach that uses histories of sensor-action pairs to divide partially observable states. The other is to use stochastic policy where the agent selects action stochastically to escape from partially observable states. The memory-based approach needs numerous memories to store histories of sensor-action pairs. Stochastic policy may generate unnecessary actions to acquire rewards. In this paper, we propose a new approach to POMDPs. For the subclass environment that does not need stochastic policy, we consider to learn a deterministic rational policy to avoid all states that manifest a Type 2 problem. We claim that the weight as an evaluation factor of a rule has the possibility to derive an irrational policy due to Type 2 problem. Therefore, no weight is used to make a rational policy. We propose the Rational Policy Making algorithm (RPM) that can learn a rational policy by direct acquirement of rational rules from that rule's definition. RPM is applied to maze environments. We show that RPM can learn the most stable rational policy in comparison with other methods.