Learning Algorithms¶

This module contains implementations of various reinforcement learning algorithms.

The core API of each learning algorithm is .get_elementwise_objective that returns the per-tick loss that you can minimize over NN weights using e.g. lasagne.updates.your_favorite_method.

Q-learning¶

Q-learning implementation. Works with discrete action space. Supports n-step updates and custom state value function (max(Q(s,a)), double q-learning, boltzmann, mellowmax, expected value sarsa,...)

agentnet.learning.qlearning.get_elementwise_objective(qvalues, actions, rewards, is_alive='always', qvalues_target=None, state_values_target=None, n_steps=1, gamma_or_gammas=0.99, crop_last=True, state_values_target_after_end='zeros', consider_reference_constant=True, aggregation_function='deprecated', force_end_at_last_tick=False, return_reference=False, loss_function=<function squared_error>)[source]¶

Returns squared error between predicted and reference Q-values according to n-step Q-learning algorithm

Qreference(state,action) = reward(state,action) + gamma*reward(state_1,action_1) + ... + gamma^n * max[action_n]( Q(state_n,action_n) loss = mean over (Qvalues - Qreference)**2

Parameters:

qvalues – [batch,tick,actions] - predicted qvalues
actions – [batch,tick] - commited actions
rewards – [batch,tick] - immediate rewards for taking actions at given time ticks
is_alive – [batch,tick] - whether given session is still active at given tick. Defaults to always active.
qvalues_target – Q-values used when computing reference (e.g. r+gamma*Q(s’,a_max). shape [batch,tick,actions] examples: (default) If None, uses current Qvalues. Older snapshot Qvalues (e.g. from a target network)
state_values_target – state values V(s), used when computing reference (e.g. r+gamma*V(s’), shape [batch_size,seq_length,1] double q-learning V(s) = Q_old(s,argmax Q_new(s,a)) expected_value_sarsa V(s) = E_a~pi(a|s) Q(s,a) state values from teacher network (knowledge transfer)

Must provide either nothing or qvalues_target or state_values_target, not both at once

Parameters:

n_steps – if an integer is given, uses n-step q-learning algorithm If 1 (default), this works exactly as normal q-learning If None: propagating rewards throughout the whole sequence of state-action pairs.
gamma_or_gammas – delayed reward discounts: a single value or array[batch,tick](can broadcast dimensions).
crop_last – if True, zeros-out loss at final tick, if False - computes loss VS Qvalues_after_end
state_values_target_after_end – [batch,1] - symbolic expression for “next best q-values” for last tick used when computing reference Q-values only. Defaults at T.zeros_like(Q-values[:,0,None,0]). if crop_last=True, simply does not penalize at last tick. If you wish to simply ignore the last tick, use defaults and crop output’s last tick ( qref[:,:-1] )
consider_reference_constant – whether or not zero-out gradient flow through reference_qvalues (True is highly recommended)
force_end_at_last_tick – if True, forces session end at last tick unless ended otehrwise
return_reference – if True, returns reference Qvalues. If False, returns squared_error(action_qvalues, reference_qvalues)
loss_function – loss_function(V_reference,V_predicted). Defaults to (V_reference-V_predicted)**2. Use to override squared error with different loss (e.g. Huber or MAE)

Returns:

mean squared error over Q-values (using formula above for loss)

SARSA¶

State-Action-Reward-State-Action (sars’a’) learning algorithm implementation. Supports n-step eligibility traces. This is an on-policy SARSA. To use off-policy Expected Value SARSA, use agentnet.learning.qlearning with custom aggregation_function

agentnet.learning.sarsa.get_elementwise_objective(qvalues, actions, rewards, is_alive='always', qvalues_target=None, n_steps=1, gamma_or_gammas=0.99, crop_last=True, state_values_target_after_end='zeros', consider_reference_constant=True, force_end_at_last_tick=False, return_reference=False, loss_function=<function squared_error>)[source]¶

Returns squared error between predicted and reference Q-values according to n-step SARSA algorithm Qreference(state,action) = reward(state,action) + gamma*reward(state_1,action_1) + ... + gamma^n*Q(state_n,action_n) loss = mean over (Qvalues - Qreference)**2

Parameters:

qvalues – [batch,tick,action_id] - predicted qvalues
actions – [batch,tick] - commited actions
rewards – [batch,tick] - immediate rewards for taking actions at given time ticks
is_alive – [batch,tick] - whether given session is still active at given tick. Defaults to always active.
qvalues_target – Q-values[batch,time,actions] or V(s)[batch_size,seq_length,1] used for reference. Some examples: (default) If None, uses current Qvalues. Older snapshot Qvalues (e.g. from a target network) Double q-learning V(s) = Q_old(s,argmax Q_new(s,a))[:,:,None] State values from teacher network (knowledge transfer)
n_steps – if an integer is given, uses n-step sarsa algorithm If 1 (default), this works exactly as normal SARSA If None: propagating rewards throughout the whole sequence of state-action pairs.
gamma_or_gammas – delayed reward discounts: a single value or array[batch,tick](can broadcast dimensions).
crop_last – if True, zeros-out loss at final tick, if False - computes loss VS Qvalues_after_end
state_values_target_after_end – [batch,1] - symbolic expression for “best next state q-values” for last tick used when computing reference Q-values only. Defaults at T.zeros_like(Q-values[:,0,None,0]) If you wish to simply ignore the last tick, use defaults and crop output’s last tick ( qref[:,:-1] )
consider_reference_constant – whether or not zero-out gradient flow through reference_qvalues (True is highly recommended)
force_end_at_last_tick – if True, forces session end at last tick unless ended otehrwise
loss_function – loss_function(V_reference,V_predicted). Defaults to (V_reference-V_predicted)**2. Use to override squared error with different loss (e.g. Huber or MAE)
return_reference – if True, returns reference Qvalues. If False, returns squared_error(action_qvalues, reference_qvalues)

Returns:

loss [squared error] over Q-values (using formula above for loss)

Advantage actor-critic¶

Advantage Actor-Critic (A2c or A3c) implementation. Follows the article http://arxiv.org/pdf/1602.01783v1.pdf Supports K-step advantage estimation as in https://arxiv.org/pdf/1506.02438v5.pdf

Agent should output action probabilities and state values instead of Q-values. Works with discrete action space only.

agentnet.learning.a2c.get_elementwise_objective(policy, state_values, actions, rewards, is_alive='always', state_values_target=None, n_steps=1, n_steps_advantage='same', gamma_or_gammas=0.99, crop_last=True, state_values_target_after_end='zeros', state_values_after_end='zeros', consider_value_reference_constant=True, force_end_at_last_tick=False, return_separate=False, treat_policy_as_logpolicy=False, loss_function=<function squared_error>)[source]¶

returns cross-entropy-like objective function for Actor-Critic method: L_policy = - log(policy) * A(s,a) L_V = (V - Vreference)^2 where A(s,a) is an advantage term (e.g. [r+gamma*V(s’) - V(s)]) and Vreference is reference state values as per Temporal Difference.

Parameters:

policy – [batch,tick,action_id] or [batch,tick] - predicted probabilities for all actions (3-dim) or chosen actions (2-dim).
state_values – [batch,tick] - predicted state values
actions – [batch,tick] - committed actions
rewards – [batch,tick] - immediate rewards for taking actions at given time ticks
is_alive – [batch,tick] - binary matrix whether given session is active at given tick. Defaults to all ones.
state_values_target – there should be state values used to compute reference (e.g. older network snapshot) If None (defualt), uses current Qvalues to compute reference
n_steps – if an integer is given, the STATE VALUE references are computed in loops of 3 states. If 1 (default), this uses a one-step TD, i.e. reference_V(s) = r+gamma*V(s’) If None: propagating rewards throughout the whole session and only taking V(s_last) at the session end.
n_steps_advantage – same as n_steps, but for advantage term A(s,a) (see above). Defaults to same as n_steps
gamma_or_gammas – a single value or array[batch,tick](can broadcast dimensions) of discount for delayed reward
crop_last – if True, zeros-out loss at final tick, if False - computes loss VS Qvalues_after_end
force_values_after_end – if true, sets reference policy at session end to rewards[end] + qvalues_after_end
state_values_target_after_end – [batch,1] - “next target state values” after last tick; used for reference. Defaults at T.zeros_like(state_values_target[:,0,None,:])
state_values_after_end – [batch,1] - “next state values” after last tick; used for reference. Defaults at T.zeros_like(state_values[:,0,None,:])
consider_value_reference_constant – whether or not to zero-out critic gradients through the reference state values term
return_separate – if True, returns a tuple of (actor loss , critic loss ) instead of their sum.
treat_policy_as_logpolicy – if True, policy is used as log(pi(a|s)). You may want to do this for numerical stability reasons.
loss_function – loss_function(V_reference,V_predicted) used for CRITIC. Defaults to (V_reference-V_predicted)**2 Use to override squared error with different loss (e.g. Huber or MAE)
force_end_at_last_tick – if True, forces session end at last tick unless ended otherwise

Returns:

elementwise sum of policy_loss + state_value_loss [batch,tick]

Deterministic policy gradient¶

Deterministic policy gradient loss, Also used for model-based acceleration algorithms. Supports regular and k-step implementation. Based on: - http://arxiv.org/abs/1509.02971 - http://arxiv.org/abs/1603.00748 - http://jmlr.org/proceedings/papers/v32/silver14.pdf

agentnet.learning.dpg.get_elementwise_objective_critic(action_qvalues, state_values, rewards, is_alive='always', n_steps=1, gamma_or_gammas=0.99, crop_last=True, state_values_after_end='zeros', force_end_at_last_tick=False, consider_reference_constant=True, return_reference=False, loss_function=<function squared_error>, scan_dependencies=(), scan_strict=True)[source]¶

Returns squared error between action values and reference (r+gamma*V(s’)) according to deterministic policy gradient.

This function can also be used for any model-based acceleration like Qlearning with normalized advantage functions.

Original article: http://arxiv.org/abs/1603.00748
Since you can provide any state_values, you can technically use any other advantage function shape

as long as you can compute V(s).

If n_steps > 1, the algorithm will use n-step Temporal Difference updates

V_reference(state,action) = reward(state,action) + gamma*reward(state_1,action_1) + ... + gamma^n * V(state_n)

Parameters:

action_qvalues – [batch,tick,action_id] - predicted qvalues
state_values – [batch,tick] - predicted state values (aka qvalues for best actions)
rewards – [batch,tick] - immediate rewards for taking actions at given time ticks
is_alive – [batch,tick] - whether given session is still active at given tick. Defaults to always active. Default value of is_alive implies a simplified computation algorithm for Qlearning loss
n_steps – if an integer is given, uses n-step TD algorithm If 1 (default), this works exactly as normal TD If None: propagating rewards throughout the whole sequence of state-action pairs.
gamma_or_gammas – delayed reward discounts: a single value or array[batch,tick](can broadcast dimensions).
crop_last – if True, zeros-out loss at final tick, if False - computes loss VS Qvalues_after_end
state_values_after_end – [batch,1] - symbolic expression for “best next state q-values” for last tick used when computing reference Q-values only. Defaults at T.zeros_like(Q-values[:,0,None,0]) If you wish to simply ignore the last tick, use defaults and crop output’s last tick ( qref[:,:-1] )
force_end_at_last_tick – if True, forces session end at last tick unless ended otehrwise
consider_reference_constant – whether or not zero-out gradient flow through reference_qvalues (True is highly recommended)
return_reference – if True, returns reference Qvalues. If False, returns loss_function(action_Qvalues, reference_qvalues)
loss_function – loss_function(V_reference,V_predicted). Defaults to (V_reference-V_predicted)**2. Use to override squared error with different loss (e.g. Huber or MAE)

Returns:

mean squared error over Q-values (using formula above for loss)

Implements layers required to train qlearning with normalized advantage functions. All the math taken from the original article: http://arxiv.org/abs/1603.00748 Loss function is exactly same as deterministic policy gradient (agentnet.learning.dpg) Usage example: https://github.com/yandexdataschool/AgentNet/blob/master/examples/Continuous%20LunarLander%20%20using%20normalized%20advantage%20functions.ipynb

agentnet.learning.qlearning_naf.LowerTriangularLayer(incoming, matrix_diag=None, **kwargs)[source]¶

agentnet.learning.qlearning_naf.NAFLayer(action_layer, mean_layer, L_layer, **kwargs)[source]¶

Generic¶

Several helper functions used in various reinforcement learning algorithms.

agentnet.learning.generic.get_values_for_actions(values_for_all_actions, actions)[source]¶: Auxiliary function to select policy/Q-values corresponding to chosen actions. :param values_for_all_actions: qvalues or similar for all actions: floatX[batch,tick,action] :param actions: action ids int32[batch,tick] :returns: values selected for the given actions: float[batch,tick]

agentnet.learning.generic.get_end_indicator(is_alive, force_end_at_t_max=False)[source]¶: Auxiliary function to transform session alive indicator into end action indicator :param force_end_at_t_max: if True, all sessions that didn’t end by the end of recorded sessions are ended at the last recorded tick.

agentnet.learning.generic.get_n_step_value_reference(state_values, rewards, is_alive='always', n_steps=None, gamma_or_gammas=0.99, crop_last=True, state_values_after_end='zeros', end_at_tmax=False, force_n_step=False)[source]¶

Computes the reference for state value function via n-step TD algorithm:

Vref = r(t) + gamma*r(t+1) + gamma^2*r(t+2) + ... + gamma^n*V(s[t+n]) where n == n_steps

Used by all n_step methods, including Q-learning, a2c and dpg

Works with both Q-values and state values, depending on aggregation_function

Parameters:

state_values – float[batch,tick] predicted state values V(s) at given batch session and time tick - for Q-learning, it’s max over Q-values - for state-value based methods (a2c, dpg), it’s same as state_values
rewards –
- float[batch,tick] rewards achieved by commiting actions at [batch,tick]
is_alive – whether the session is still active at given tick, int[batch_size,time] of ones and zeros
n_steps – if an integer is given, the references are computed in loops of n_steps Every n_steps’th step reference is set to V = r + gamma * next V_predicted On other steps, reference is propagated V = r + gamma * next V reference Defaults to None: propagating rewards throughout the whole session. Widely known as “lambda” in RL community (TD-lambda, Q-lambda) plus or minus one :) If n_steps equals 1, this works exactly as regular TD (though a less efficient one) If you provide symbolic integer here AND strict = True, make sure you added the variable to dependencies.
gamma_or_gammas – delayed reward discount number, scalar or vector[batch_size]
crop_last – if True, ignores loss for last tick(default)
state_values_after_end –
- symbolic expression for “next state values” for last tick used for reference only.
Defaults at T.zeros_like(values[:,0,None,:]) If you wish to simply ignore the last tick, use defaults and crop output’s last tick ( qref[:,:-1] )
end_at_tmax – if True, forces session end at last tick if there was no other session end.
force_n_step – if True, does NOT fall back to 1-step algorithm if n_steps = 1

Returns:

V reference [batch,action_at_tick] according n-step algorithms ~ eligibility traces e.g. mentioned here http://arxiv.org/pdf/1602.01783.pdf as A3c and k-step Q-learning also described here https://arxiv.org/pdf/1506.02438v5.pdf for k-step advantage

agentnet.learning.generic.get_1_step_value_reference(state_values, rewards, is_alive='always', gamma_or_gammas=0.99, crop_last=True, state_values_after_end='zeros', end_at_tmax=False)[source]¶

Computes the reference for state value function via 1-step TD algorithm:

Vref = r(t) + gamma*V(s’)

Used as a fall-back by n-step algorithm when n_steps=1 (performance reasons)

Parameters:

state_values – float[batch,tick] predicted state values V(s) at given batch session and time tick - for Q-learning, it’s max over Q-values - for state-value based methods (a2c, dpg), it’s same as state_values
rewards –
- float[batch,tick] rewards achieved by committing actions at [batch,tick]
is_alive – whether the session is still active int/bool[batch_size,time]
gamma_or_gammas – delayed reward discount number, scalar or vector[batch_size]
crop_last – if True, ignores loss for last tick(default)
state_values_after_end –
- symbolic expression for “next state values” for last tick used for reference only.
Defaults at T.zeros_like(values[:,0,None,:]) If you wish to simply ignore the last tick, use defaults and crop output’s last tick ( qref[:,:-1] )
end_at_tmax – if True, forces session end at last tick if there was no other session end.

Returns:

V reference [batch,action_at_tick] = r + gamma*V(s’)