This class implements the top-two Thompson sampling policy. More...

#include <AIToolbox/Bandit/Policies/TopTwoThompsonSamplingPolicy.hpp>

Inheritance diagram for AIToolbox::Bandit::TopTwoThompsonSamplingPolicy:

Public Member Functions
	TopTwoThompsonSamplingPolicy (const Experience &exp, double beta)
	Basic constructor. More...

virtual size_t	sampleAction () const override
	This function chooses an action using top-two Thompson sampling. More...

size_t	recommendAction () const
	This function returns the most likely best action until this point. More...

virtual double	getActionProbability (const size_t &a) const override
	This function returns the probability of taking the specified action. More...

virtual Vector	getPolicy () const override
	This function returns a vector containing all probabilities of the policy. More...

const Experience &	getExperience () const
	This function returns a reference to the underlying Experience we use. More...

Public Member Functions inherited from AIToolbox::PolicyInterface< void, void, size_t >
	PolicyInterface (void s, size_t a)
	Basic constructor. More...

virtual	~PolicyInterface ()
	Basic virtual destructor. More...

virtual size_t	sampleAction (const void &s) const=0
	This function chooses a random action for state s, following the policy distribution. More...

virtual double	getActionProbability (const void &s, const size_t &a) const=0
	This function returns the probability of taking the specified action in the specified state. More...

const void &	getS () const
	This function returns the number of states of the world. More...

const size_t &	getA () const
	This function returns the number of available actions to the agent. More...

Additional Inherited Members
Public Types inherited from AIToolbox::Bandit::PolicyInterface
using	Base = AIToolbox::PolicyInterface< void, void, size_t >

Protected Attributes inherited from AIToolbox::PolicyInterface< void, void, size_t >
void	S

size_t	A

RandomEngine	rand_

Detailed Description

This class implements the top-two Thompson sampling policy.

This class uses the Student-t distribution to model normally-distributed rewards with unknown mean and variance. As more experience is gained, each distribution becomes a Normal which models the mean of its respective arm.

The top-two Thompson sampling policy is designed to be used in a pure exploration setting. In other words, we wish to discover the best arm in the shortest possible time, without the need to minimize regret while doing so. This last part is the key difference to many bandit algorithms, that try to exploit their knowledge more and more as time goes on.

The way this works is by focusing arm pulls on the currently estimated top two arms, since those are the most likely to contend for the "title" of best arm. The two top arms are estimated using Thompson sampling. We first sample a first best action, and then, if needed, we keep sampling until a new, different best action is sampled.

We either take the first action sampled with probability beta, or the other with probability 1-beta.

Constructor & Destructor Documentation

◆ TopTwoThompsonSamplingPolicy()

AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::TopTwoThompsonSamplingPolicy	(	const Experience &	exp,
		double	beta
	)

Basic constructor.

Parameters

exp	The Experience we learn from.
beta	The probability of playing the first sampled best action instead of the second sampled best.

Member Function Documentation

◆ getActionProbability()

virtual double AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::getActionProbability ( const size_t & a ) const

overridevirtual

This function returns the probability of taking the specified action.

WARNING: The only way to compute the true probability of selecting the input action is via empirical sampling. we simply call sampleAction() a lot and return an approximation of the times the input action was actually selected. This makes this function very very SLOW. Do not call at will!!

Parameters

a	The selected action.

Returns: This function returns an approximation of the probability of choosing the input action.

◆ getExperience()

const Experience& AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::getExperience ( ) const

This function returns a reference to the underlying Experience we use.

Returns: The internal Experience reference.

◆ getPolicy()

virtual Vector AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::getPolicy ( ) const

overridevirtual

This function returns a vector containing all probabilities of the policy.

Ideally this function can be called only when there is a repeated need to access the same policy values in an efficient manner.

WARNING: This can be really expensive, as it does pretty much the same work as getActionProbability(). It shouldn't be slower than that call though, so if you do need the overall policy, do call this method.

Implements AIToolbox::Bandit::PolicyInterface.

◆ recommendAction()

size_t AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::recommendAction ( ) const

This function returns the most likely best action until this point.

Returns: The most likely best action.

◆ sampleAction()

virtual size_t AIToolbox::Bandit::TopTwoThompsonSamplingPolicy::sampleAction ( ) const

overridevirtual

This function chooses an action using top-two Thompson sampling.

Returns: The chosen action.

The documentation for this class was generated from the following file:

include/AIToolbox/Bandit/Policies/TopTwoThompsonSamplingPolicy.hpp

Public Member Functions

Additional Inherited Members

Detailed Description

Constructor & Destructor Documentation

◆ TopTwoThompsonSamplingPolicy()

Member Function Documentation

◆ getActionProbability()

◆ getExperience()

◆ getPolicy()

◆ recommendAction()

◆ sampleAction()