AIToolbox
A library that offers tools for AI problem solving.
|
This class represents the mining bandit problem. More...
#include <AIToolbox/Factored/Bandit/Environments/MiningProblem.hpp>
Public Member Functions | |
MiningBandit (Action A, std::vector< unsigned > workersPerVillage, std::vector< double > productivityPerMine, bool normalizeToOne=true) | |
Basic constructor. More... | |
const Rewards & | sampleR (const Action &a) const |
This function samples the rewards for each mine from a set of Bernoulli distributions. More... | |
double | getRegret (const Action &a) const |
This function computes the deterministic regret of the input joint action. More... | |
const Action & | getOptimalAction () const |
This function returns the optimal action for this bandit. More... | |
const Action & | getA () const |
This function returns the joint action space. More... | |
const std::vector< PartialKeys > & | getGroups () const |
This function returns, for each mine, which villages are connected to it. More... | |
std::vector< QFunctionRule > | getDeterministicRules () const |
This function returns a set of QFunctionRule for the bandit, ignoring stochasticity. More... | |
double | getNormalizationConstant () const |
This function returns the normalization constant used. More... | |
This class represents the mining bandit problem.
This problem was introduced in the paper
"Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems"
by Bargiacchi et al.
There are a set of villages and mines. Each village has a number of mine workers. At each timestep, the village sends all its mine workers to a single mine. Each timestep, each mine produces an amount of minerals proportional to its hidden productivity and the number of workers sent to it.
For each index 'i', each village 'i' is always connected to the mines with indeces from 'i' onwards. The last village is always connected to 4 mines.
Thus, the action for a given village is a number from 0 to N; where 0 corresponds to sending all the workers to the 'i' mine . Action N is instead sending all the workers to the mine number 'i' + N.
The mineral amounts produced by each mine are computed with this formula:
Since these amounts are deterministic for each joint action, discovering the optimal action would be too easy. To generate a proper bandit, we convert these amounts into stochastic rewards through Bernoulli distributions.
First, we optionally normalize the outputs of each mine that the maximum joint mineral amount that can be produced is 1. This is useful sometimes as it results in pretty values for regrets and rewards (since the expected optimal action then has reward exactly 1).
In any case, each mine will be associated with a number between 0 and
Note that this means that it can happen that an action randomly produces a higher reward than 1 (since multiple Bernoullis are sampled). However, on average the optimal action will have an expected reward of 1.
AIToolbox::Factored::Bandit::MiningBandit::MiningBandit | ( | Action | A, |
std::vector< unsigned > | workersPerVillage, | ||
std::vector< double > | productivityPerMine, | ||
bool | normalizeToOne = true |
||
) |
Basic constructor.
A | The action space. There is one action per village, which represents to which mine to send the workers. |
workersPerVillage | How many workers there are in each village. |
productivityPerMine | The productivity factor for each mine (between 0 and 1). |
normalizeToOne | Whether to normalize rewards so that the optimal action has expected reward 1. |
const Action& AIToolbox::Factored::Bandit::MiningBandit::getA | ( | ) | const |
This function returns the joint action space.
std::vector<QFunctionRule> AIToolbox::Factored::Bandit::MiningBandit::getDeterministicRules | ( | ) | const |
This function returns a set of QFunctionRule for the bandit, ignoring stochasticity.
This function is provided for testing maximization algorithms, to automatically generate rules for a given set of parameters.
The rules contain the true underlying rewards of the problem, ignoring the sampling stochasticity that is present in the sampleR() function.
In other words, finding the joint action that maximizes these rules is equivalent to finding the optimal action of the bandit.
const std::vector<PartialKeys>& AIToolbox::Factored::Bandit::MiningBandit::getGroups | ( | ) | const |
This function returns, for each mine, which villages are connected to it.
This function returns, for each local reward function (a mine), all groups of agents connected to it (villages).
double AIToolbox::Factored::Bandit::MiningBandit::getNormalizationConstant | ( | ) | const |
This function returns the normalization constant used.
This class ensures that the optimal action has an expected reward of 1. To do this, we normalize each local reward obtained by the unnormalized expected maximum possible reward.
This function returns that value.
const Action& AIToolbox::Factored::Bandit::MiningBandit::getOptimalAction | ( | ) | const |
This function returns the optimal action for this bandit.
double AIToolbox::Factored::Bandit::MiningBandit::getRegret | ( | const Action & | a | ) | const |
This function computes the deterministic regret of the input joint action.
This function bypassed the Bernoulli distributions and directly computes the true regret of any given joint action.
a | The joint action of all villages. |
This function samples the rewards for each mine from a set of Bernoulli distributions.
a | The joint action of all villages. |