AIToolbox
A library that offers tools for AI problem solving.
|
This tutorial assumes you have read and understood the MDP tutorial.
This tutorial's code can be found in the examples/POMDP/tiger_door.cpp
file, including comments and additional nodes.
Another good tutorial on POMDPs available online can be found here.
A Partially Observable Markov Decision Process, or POMDP in short, is a generalization of the ideas of an MDP to environments where the agents does not directly know the current state of the underlying environment. It simply has no access to it.
Instead, after performing each action, the agent receives an observation which helps it restrict the range of the possible current states. The observation is obtained from the environment following an observation function, which dictates the probability of obtaining a certain observation given a certain state and action.
Indeed, a POMDP is defined as a tuple <S, A, O, T, R, W, d>, where:
The agent thus, at any timestep, maintains a belief of which states the environment is actually in: some will be more likely, and some less. The belief is then a discrete probability distribution over all states, indicating what the agent thinks likely.
In POMDP planning this belief can be maintained over time, as the agent knows the POMDP dynamics. Thus, given his previous belief and knowing which action has been performed and what observation was received, the agent can compute a new belief, with probabilities for every possible state of the environment.
Given that this is a complicated process, POMDP policies are usually in the shape of a tree. Depending on which action was performed, and which observation was received, the agent descends the appropriate tree branch and sees what action it should perform next. Thus, in POMDPs the horizon (the number of timesteps that you plan to act in the environment) is very important, as it significantly affects which actions you should take.
Keep in mind that POMDPs are much harder to solve than MDPs, so it's important to keep the size of the state,action and observation spaces as small as possible.
Let's try to create an POMDP to show how it would work out in practice. Our problem is going to be the following:
Suppose you are in front of two doors. Behind one is a treasure. Behind the other a man-eating tiger. You don't know in advance where each is, but you want to try to get to the treasure safely. Also, after picking, the doors are reset, so you can try again.
You can try to listen for the breathing of the tiger. The world around you is somewhat noisy, so you may sometimes think you hear something coming from the wrong door.
How much time should you spend listening for the tiger, before trying your luck and opening a door?
Let's think about how to encode this world into an POMDP. We will go through each component of the POMDP and try to fill it out.
There are only two possibilities: either the door on the left holds the treasure (and the tiger is behind the right), or vice-versa. This will be simple!
Only three actions: open the door on the left, open the door on the right, or wait and listen for the breathing tiger.
Finally, the observations: there are also only two. When we listen, we get an observation on whether we heard the tiger behind the left door or the right one.
Keep in mind that such an observation does not mirror the truth! We may have misheard, and it's this uncertainty that makes this problem a POMDP (rather than an MDP).
The transition function in this problem is also quite simple. If we are listening, nothing changes in the world. When we instead open a door, we discover what's behind it, and the problem is reset.
We can directly write a 3-dimensional transition matrix, as this should be relatively simple.
The reward function is similar. We want to give a small penalty for listening, so that the agent won't try that forever. We'll give a decent reward for obtaining a treasure, and a great penalty for opening the door to the tiger.
Finally, the observation function. We want to give the agent an 85% chance of hearing the tiger correctly. So most of the time the observation will mirror the truth, but not always.
We also have to give the agent an observation when it opens a door; in that case we return an observation randomly (with probability 1.0 / O
), so not to give any information to the agent during a reset.
We finally use a discount of 0.95 as it's usually done with this problem. Its meaning has not changed from the MDP case.
Since, differently from the MDP tutorial, we have not written functions for T, R and W, but instead we have directly created matrices, we can simply copy them into a new AIToolbox::POMDP::Model
to start solving.
The default POMDP model is an extension of a given MDP (since as we said, the POMDP is a generalization of MDPs; the basic environment is still there, it's just that the agent cannot see it directly). In our case, we inherit from AIToolbox::MDP::Model
to keep things simple.
In case of this POMDP, it's simple enough that we can solve it exactly. We use the AIToolbox::POMDP::IncrementalPruning
algorithm, which is one of the most performant exact solvers available.
Here we show also some code that explains how to use the value function from the solution to obtain a policy, and how to use it. As we mentioned, policies are a bit more complex in POMDPs due to their tree-shape.
Please check out the documentation of AIToolbox::POMDP::Policy
to better learn how to use it.
The full code of this example can be found in the examples/POMDP/tiger_door.cpp
file, and is built automatically by adding -DMAKE_EXAMPLES=1
when running CMake.
This tutorial has given you a very brief introduction in the world of POMDPs. Given that they are so hard to solve, a lot of research has been done in approximate solvers: point-based solvers in particular, where the POMDP is solved only for a certain number of possible beliefs, have seen great success both in theory and in practical applications. AIToolbox implements some of them, for example AIToolbox::POMDP::PBVI
, AIToolbox::POMDP::PERSEUS
, or even the bound-based AIToolbox::POMDP::GapMin
.
Remember to read each class' documentation, as they explain the ideas behind each algorithm, and how to use them, and feel free to check out external references to learn more about POMDPs.