AIToolbox
A library that offers tools for AI problem solving.
AIToolbox::MDP::RLearning Class Reference

This class represents the RLearning algorithm. More...

#include <AIToolbox/MDP/Algorithms/RLearning.hpp>

Public Member Functions

 RLearning (size_t S, size_t A, double alpha=0.1, double rho=0.1)
 Basic constructor. More...
 
template<IsGenerativeModel M>
 RLearning (const M &model, double alpha=0.1, double rho=0.1)
 Basic constructor. More...
 
void setAlphaLearningRate (double a)
 This function sets the learning rate parameter for the QFunction. More...
 
double getAlphaLearningRate () const
 This function will return the current set alpha learning rate parameter. More...
 
void setRhoLearningRate (double r)
 This function sets the learning rate parameter for the average reward. More...
 
double getRhoLearningRate () const
 This function will return the current set rho learning rate parameter. More...
 
void stepUpdateQ (size_t s, size_t a, size_t s1, double rew)
 This function updates the internal QFunction using the discount set during construction. More...
 
size_t getS () const
 This function returns the number of states on which QLearning is working. More...
 
size_t getA () const
 This function returns the number of actions on which QLearning is working. More...
 
const QFunctiongetQFunction () const
 This function returns a reference to the internal QFunction. More...
 
double getAverageReward () const
 This function returns the learned average reward. More...
 
void setQFunction (const QFunction &qfun)
 This function allows to directly set the internal QFunction. More...
 

Detailed Description

This class represents the RLearning algorithm.

This algorithm is an analogue to QLearning, when one wishes to learn to maximize average reward in infinitely long episodes, rather than discounted reward. Such policies are called T-optimal policies.

Indeed, RLearning makes the point that discount is an unnecessary and harmful abstraction in these cases, and that it is generally only used to bound the expected reward when acting infinitely. At the same time, discounting can result in policies which are unnecessarily greedy and don't maximize average reward over time.

Thus, the update rule for the QFunction is slightly altered, so that, for each state-action pair, we learn the expected average-adjusted reward (present and future), i.e. the reward minus the average reward, which is the measure we want to learn to act upon. To do so, we also need to learn the average reward.

The two elements are learned side by side, and this is why here we have two separate learning rates; one for the QFunction and the other for the average reward. Note that the original paper calls these respectively the beta and alpha learning rate. Here, to keep consistency between methods, we call these alpha and rho. We also rename the standard setLearningRate() function to make sure that users understand what they are setting.

See also
setAlphaLearningRate(double)
setRhoLearningRate(double)

This algorithm does not actually need to sample from the input model, and so it can be a good algorithm to apply in real world scenarios, where there would be no way to reproduce the world's behavior aside from actually trying out actions. However it is needed to know the size of the state space, the size of the action space and the discount factor of the problem.

Constructor & Destructor Documentation

◆ RLearning() [1/2]

AIToolbox::MDP::RLearning::RLearning ( size_t  S,
size_t  A,
double  alpha = 0.1,
double  rho = 0.1 
)

Basic constructor.

Both learning rates must be > 0.0 and <= 1.0, otherwise the constructor will throw an std::invalid_argument.

Parameters
SThe size of the state space.
AThe size of the action space.
alphaThe learning rate for the QFunction.
rhoThe learning rate for the average reward.

◆ RLearning() [2/2]

template<IsGenerativeModel M>
AIToolbox::MDP::RLearning::RLearning ( const M &  model,
double  alpha = 0.1,
double  rho = 0.1 
)

Basic constructor.

Both learning rates must be > 0.0 and <= 1.0, otherwise the constructor will throw an std::invalid_argument.

This constructor copies the S and A and discount parameters from the supplied model. It does not keep the reference, so if the discount needs to change you'll need to update it here manually too.

Parameters
modelThe MDP model that QLearning will use as a base.
alphaThe learning rate for the QFunction.
rhoThe learning rate for the average reward.

Member Function Documentation

◆ getA()

size_t AIToolbox::MDP::RLearning::getA ( ) const

This function returns the number of actions on which QLearning is working.

Returns
The number of actions.

◆ getAlphaLearningRate()

double AIToolbox::MDP::RLearning::getAlphaLearningRate ( ) const

This function will return the current set alpha learning rate parameter.

Returns
The currently set alpha learning rate parameter.

◆ getAverageReward()

double AIToolbox::MDP::RLearning::getAverageReward ( ) const

This function returns the learned average reward.

Returns
The learned average reward.

◆ getQFunction()

const QFunction& AIToolbox::MDP::RLearning::getQFunction ( ) const

This function returns a reference to the internal QFunction.

The returned reference can be used to build Policies, for example MDP::QGreedyPolicy.

Returns
The internal QFunction.

◆ getRhoLearningRate()

double AIToolbox::MDP::RLearning::getRhoLearningRate ( ) const

This function will return the current set rho learning rate parameter.

Returns
The currently set rho learning rate parameter.

◆ getS()

size_t AIToolbox::MDP::RLearning::getS ( ) const

This function returns the number of states on which QLearning is working.

Returns
The number of states.

◆ setAlphaLearningRate()

void AIToolbox::MDP::RLearning::setAlphaLearningRate ( double  a)

This function sets the learning rate parameter for the QFunction.

The learning parameter determines the speed at which the QFunction is modified with respect to new data. In fully deterministic environments (such as an agent moving through a grid, for example), this parameter can be safely set to 1.0 for maximum learning.

On the other side, in stochastic environments, in order to converge this parameter should be higher when first starting to learn, and decrease slowly over time.

Otherwise it can be kept somewhat high if the environment dynamics change progressively, and the algorithm will adapt accordingly. The final behavior of QLearning is very dependent on this parameter.

The learning rate parameter must be > 0.0 and <= 1.0, otherwise the function will throw an std::invalid_argument.

Parameters
aThe new alpha learning rate parameter.

◆ setQFunction()

void AIToolbox::MDP::RLearning::setQFunction ( const QFunction qfun)

This function allows to directly set the internal QFunction.

This can be useful in order to use a QFunction that has already been computed elsewhere. RLearning will then continue building upon it.

This is used for example in the Dyna2 algorithm.

Parameters
qfunThe new QFunction to set.

◆ setRhoLearningRate()

void AIToolbox::MDP::RLearning::setRhoLearningRate ( double  r)

This function sets the learning rate parameter for the average reward.

The learning parameter determines the speed at which the average reward is modified with respect to new data.

The learning rate parameter must be > 0.0 and <= 1.0, otherwise the function will throw an std::invalid_argument.

Parameters
rThe new rho learning rate parameter.

◆ stepUpdateQ()

void AIToolbox::MDP::RLearning::stepUpdateQ ( size_t  s,
size_t  a,
size_t  s1,
double  rew 
)

This function updates the internal QFunction using the discount set during construction.

This function takes a single experience point and uses it to update the QFunction. This is a very efficient method to keep the QFunction up to date with the latest experience.

Parameters
sThe previous state.
aThe action performed.
s1The new state.
rewThe reward obtained.

The documentation for this class was generated from the following file: