Memory layers¶

This module contains a number of Lasagne layers useful when designing agent memory.

Memory layers can be vaguely divided into classical recurrent layers (e.g. RNN, LSTM, GRU) and augmentations (Stack augmentation, window augmentation, etc.).

Technically memory layers are lasagne layers that take previous memory state and some optional inputs to return new memory state.

For example, to build RNN with 36 units one has to define

#RNN input rnn_input = some_lasagne_layer

#rnn state from previous tick prev_rnn = InputLayer( (None,36) ) #None for batch size

#new RNN state (i.e. sigma(Wi * rnn_input + Wh * prev_rnn + b) ) new_rnn = RNNCell(prev_rnn, rnn_input)

When using inside Agent (MDPAgent) or Recurrence, one must register them as agent_states (for agent) or state_variables (for recurrence), e.g.

from agentnet.agent import Agent agent = Agent(observations,{new_rnn : prev_rnn},...)

Standard recurrent layers¶

agentnet.memory.RNNCell(prev_state, input_or_inputs=(), nonlinearity=<function tanh>, num_units=None, name=None, grad_clipping=0, Whid=<lasagne.init.Uniform object>, Winp=<lasagne.init.Uniform object>, b=<lasagne.init.Constant object>)[source]¶

Implements a one-step recurrent neural network (RNN) with arbitrary number of units.

Parameters:	prev_state – input that denotes previous state (shape must be (None, n_units) ) input_or_inputs – a single layer or a list/tuple of layers that go as inputs nonlinearity – which nonlinearity to use num_units – how many recurrent cells to use. None means “as in prev_state” grad_clipping – maximum gradient absolute value. 0 or None means “no clipping”
Returns:	updated memory layer
Return type:	lasagne.layers.Layer

for developers:: Works by stacking DenseLayers with ElemwiseSumLayer. is a function mock, not actual class.

agentnet.memory.GRUCell(prev_state, input_or_inputs=(), num_units=None, weight_init=<lasagne.init.Normal object>, bias_init=<lasagne.init.Constant object>, forgetgate_nonlinearity=<function sigmoid>, updategate_nonlinearity=<function sigmoid>, hidden_update_nonlinearity=<function tanh>, dropout=0, name='YetAnotherGRULayer', grad_clipping=0)[source]¶

Implements a one-step gated recurrent unit (GRU) with arbitrary number of units.

Parameters:	prev_state (lasagne.layers.Layer) – input that denotes previous state (shape must be (None, n_units) ) input_or_inputs (lasagne.layers.Layer or list of such) – a single layer or a list/tuple of layers that go as inputs num_units (int) – how many recurrent cells to use. None means “as in prev_state” weight_init – either a lasagne initializer to use for every gate weights or a list of two initializers: - first used for all weights from hidden -> <any>_gate and hidden update - second used for all weights from input(s) -> <any>_gate weights and hidden update or a list of two objects elements: - second list is hidden -> forget gate, update gate, hidden update - second list of lists where list[i][0,1,2] = input[i] -> [forget gate, update gate, hidden update] <any>_nonlinearity – which nonlinearity to use for a particular gate dropout – dropout rate as per https://arxiv.org/abs/1603.05118 grad_clipping – maximum gradient absolute value. 0 or None means “no clipping”
Returns:	updated memory layer
Return type:	lasagne.layers.Layer

for developers:: Works by stacking other lasagne layers; is a function mock, not actual class.

agentnet.memory.LSTMCell(prev_cell, prev_out, input_or_inputs=(), num_units=None, peepholes=True, weight_init=<lasagne.init.Normal object>, bias_init=<lasagne.init.Constant object>, peepholes_W_init=<lasagne.init.Normal object>, forgetgate_nonlinearity=<function sigmoid>, inputgate_nonlinearity=<function sigmoid>, outputgate_nonlinearity=<function sigmoid>, cell_nonlinearity=<function tanh>, output_nonlinearity=<function tanh>, dropout=0.0, name=None, grad_clipping=0.0)[source]¶

Implements a one-step LSTM update. Note that LSTM requires both c_t (private memory) and h_t aka output.

Parameters:	prev_cell (lasagne.layers.Layer) – input that denotes previous “private” state (shape must be (None, n_units) ) prev_out (lasagne.layers.Layer) – input that denotes previous “public” state (shape must be (None,n_units)) input_or_inputs (lasagne.layers.Layer or list of such) – a single layer or a list/tuple of layers that go as inputs num_units (int) – how many recurrent cells to use. None means “as in prev_state” peepholes (bool) – If True, the LSTM uses peephole connections. When False, peepholes_W_init are ignored. bias_init – either a lasagne initializer to use for every gate weights or a list of 4 initializers for [input gate, forget gate, cell, output gate] weight_init – either a lasagne initializer to use for every gate weights: or a list of two initializers, - first used for all weights from hidden -> <all>_gate and cell - second used for all weights from input(s) -> <all>_gate weights and cell or a list of two objects elements, - second list is hidden -> input gate, forget gate, cell, output gate, - second list of lists where list[i][0,1,2] = input[i] -> [input_gate, forget gate, cell,output gate ] peepholes_W_init – either a lasagne initializer or a list of 3 initializers for [input_gate, forget gate,output gate ] weights. If peepholes=False, this is ignored. <any>_nonlinearity – which nonlinearity to use for a particular gate dropout – dropout rate as per https://arxiv.org/pdf/1603.05118.pdf grad_clipping – maximum gradient absolute value. 0 or None means “no clipping”
Returns:	a tuple of (new_cell,new_output) layers
Return type:	(lasagne.layers.Layer,lasagne.layers.Layer)

for developers:: Works by stacking other lasagne layers; is a function mock, not actual class.

Augmentations¶

agentnet.memory.AttentionLayer(input_sequence, query, num_units, mask_input=None, key_sequence=None, nonlinearity=<theano.tensor.elemwise.Elemwise object>, probs_nonlinearity=<function softmax>, W_enc=<lasagne.init.Normal object>, W_dec=<lasagne.init.Normal object>, W_out=<lasagne.init.Normal object>, **kwargs)[source]¶

A layer that implements basic Bahdanau-style attention. Implementation is inspired by tfnn@yandex.

Kurzgesagt, attention lets network decide which fraction of sequence/image should it view now by using small one-layer block that predicts (input_element,query) -> do i want to see input_element for all input_elements. You can read more about it here - http://distill.pub/2016/augmented-rnns/ .

AttentionLayer also allows you to have separate keys and values: it computes logits with keys, then converts them to weights(probs) and averages _values_ with those weights.

This layer outputs a dict with keys “attn” and “probs” - attn - inputs processed with attention, shape [batch_size, enc_units] - probs - probabilities for each activation [batch_size, seq_length]

This layer assumes input sequence/image/video/whatever to have 1 spatial dimension (see below). - rnn/emb format [batch,seq_len,units] works out of the box - 1d convolution format [batch,units,seq_len] needs dimshuffle(conv,[0,2,1]) - 2d convolution format [batch,units,dim1,dim2] needs two-step procedure - step1 = dimshuffle(conv,[0,2,3,1]) - step2 = reshape(step1,[-1,dim1*dim2,units]) - higher dimensionality follows the same principle as 2d example above - reshape and dimshuffle can both be found in lasagne.layers (aliases to ReshapeLayer and DimshuffleLayer)

When calling get_output, you can pass flag hard_attention=True to replace attention with argmax over logits.

Parameters:

input_sequence (lasagne.layers.Layer with shape [batch,seq_length,units]) – sequence of inputs to be processed with attention
query (lasagne.layers.Layer with shape [batch,units]) – single time-step state of decoder (usually lstm/gru/rnn hid)
num_units (int) – number of hidden units in attention intermediate activation
key_sequence (lasagne.layers.Layer with shape [batch,seq_length,units] or None) – a sequence of keys to compute dot_product with. By default, uses input_sequence instead.
nonlinearity (function(x) -> x that works with theano tensors) – nonlinearity in attention intermediate activation
probs_nonlinearity (function(x) -> x that works with theano tensors) – nonlinearity that converts logits of shape [batch,seq_length] into attention weights of same shape (you can provide softmax with tunable temperature or gumbel-softmax or anything of the sort)
mask_input (lasagne.layers.Layer with shape [batch,seq_length]) – mask for input_sequence (like other lasagne masks). Default is no mask

Other params can be theano shared variable, expression, numpy array or callable. Initial value, expression or initializer for the weights. These should be a matrix with shape (num_inputs, num_units). See lasagne.utils.create_param() for more information.

The roles of those params are: W_enc - weights from encoder (each state) to hidden layer W_dec - weights from decoder (each state) to hidden layer W_out - hidden to logit weights No logit biases are introduces because softmax is invariant to adding bias to each logit

agentnet.memory.DotAttentionLayer(input_sequence, query, key_sequence=None, mask_input=None, scale=False, use_dense_layer=False, probs_nonlinearity=<function softmax>, **kwargs)[source]¶

A layer that implements multiplicative (Dotproduct) attention. Implementation is inspired by tfnn@yandex.

Unlike AttentionLayer, DotAttention requires you to provide query in the same shape as one time-step of the input sequence. Otherwise it does so via DenseLayer.

DotAttention also allows you to have separate keys and values: it computes logits with keys, then converts them to weights(probs) and averages _values_ with those weights.

Kurzgesagt, attention lets network decide which fraction of sequence/image should it view now by using small one-layer block that predicts (input_element,query) -> do i want to see input_element for all input_elements. You can read more about it here - http://distill.pub/2016/augmented-rnns/ .

This layer outputs a dict with keys “attn” and “probs” - attn - inputs processed with attention, shape [batch_size, enc_units] - probs - probabilities for each activation [batch_size, seq_length]

This layer assumes input sequence/image/video/whatever to have 1 spatial dimension (see below). - rnn/emb format [batch,seq_len,units] works out of the box - 1d convolution format [batch,units,seq_len] needs dimshuffle(conv,[0,2,1]) - 2d convolution format [batch,units,dim1,dim2] needs two-step procedure

step1 = dimshuffle(conv,[0,2,3,1])

step2 = reshape(step1,[-1,dim1*dim2,units])

higher dimensionality follows the same principle as 2d example above
reshape and dimshuffle can both be found in lasagne.layers (aliases to ReshapeLayer and DimshuffleLayer)

When calling get_output, you can pass flag hard_attention=True to replace attention with argmax over logits.

Parameters:

input_sequence (lasagne.layers.Layer with shape [batch,seq_length,units]) – sequence of inputs to be processed with attention
query (lasagne.layers.Layer with shape [batch,units]) – single time-step state of decoder that is used as query (usually custom layer or lstm/gru/rnn hid) If it matches input_sequence one-step size, query is used as is. Otherwise, DotAttention is performed from DenseLayer(query,input_units,nonlinearity=None).
key_sequence (lasagne.layers.Layer with shape [batch,seq_length,units] or None) – a sequence of keys to compute dot_product with. By default, uses input_sequence instead.
mask_input (lasagne.layers.Layer with shape [batch,seq_length]) – mask for input_sequence (like other lasagne masks). Default is no mask
scale – if True, scales query.dot(key) by key_size**-0.5 to maintain variance. Otherwise does nothing.
use_dense_layer – if True, forcibly creates intermediate dense layer on top of query
probs_nonlinearity (function(x) -> x that works with theano tensors) – nonlinearity that converts logits of shape [batch,seq_length] into attention weights of same shape (you can provide softmax with tunable temperature or gumbel-softmax or anything of the sort)

agentnet.memory.StackAugmentation(observation_input, prev_state_input, controls_layer, **kwargs)[source]¶

A special kind of memory augmentation that implements end-to-end diferentiable stack in accordance to this paper: http://arxiv.org/abs/1503.01007

Parameters:

observation_input (lasagne.layers.Layer) – an item that can be pushed into the stack (e.g. RNN state)
prev_state_input (lasagne.layers.Layer (usually InputLayer)) – revious stack state of shape [batch,stack depth, stack item size]
controls_layer (lasagne.layers.layer (usually DenseLayer with softmax nonlinearity)) – a layer with 3 channels: PUSH_OP, POP_OP and NO_OP accordingly (must sum to 1)

A simple snippet that runs that augmentation from the Stack RNN example stack_width = 3 stack_depth = 50 # previous stack goes here prev_stack_layer = InputLayer((None,stack_depth,stack_width)) # Stack controls - push, pop and no-op stack_controls_layer = DenseLayer(<rnn>,3, nonlinearity=lasagne.nonlinearities.softmax,) # stack input stack_input_layer = DenseLayer(<rnn>,stack_width) #new stack state next_stack = StackAugmentation(stack_input_layer,prev_stack_layer,stack_controls_layer)

agentnet.memory.WindowAugmentation(new_value_input, prev_state_input, **kwargs)[source]¶

An augmentation that holds K previous items in memory, used in DeepMind Atari architecture from original article

Supports a window of K last values of new_value_input. Each time new element is pushed into the window, the oldest one gets out.

Parameters:	new_value_input (lasagne.layers.Layer, shape must be compatible with prev_state_input (see next)) – a newest item to be stored in the window prev_state_input (lasagne.layers.Layer, normally InputLayer) – previous window state of shape [batch, window_length, item_size]

Shapes of new_value_input and prev_state_input must match: if new_value_input has shape (batch, a, b, c) than prev_state_input must have shape (batch, window_size, a, b, c)

where a,b,c stands for arbitrary shapes (e.g. channel, width and height of an image). An item can have arbitrary number of dimensions as long as first one is batch_size

Window shape and K are defined as prev_state_input.output_shape

The state shape consists of (batch_i, relative_time_inverted, new_value shape) So, last inserted value would be at state[:,0], pre-last - at state[:,1] etc.

And yes, K = prev_state_input.output_shape[1].

agentnet.memory.CounterLayer(prev_counter, k=None, name=None)[source]¶

A simple counter Layer that increments it’s state by 1 each turn and loops each k iterations

Parameters:	prev_counter (lasagne.layers.Layer, normally InputLayer) – previous state of counter k – if not None, resets counter to zero each k timesteps
Returns:	incremented counter
Return type:	lasagne.layers.Layer

agentnet.memory.SwitchLayer(condition, than_branch, else_branch, name=None)[source]¶

a simple layer that implements an ‘if-than-else’ logic

Parameters:	condition (lasagne.layers.Layer) – a layer with [batch_size] boolean conditions (dtype int) than_branch* – branch that happens if condition != 0 for particular element of a batch else_branch – branch that happens if condition == 0 for particular element of a batch

Shapes and dtypes of the two branches must match.

Returns:	a layer where i-th batch sample will take than_branch value if condition, else else_branch value
Return type:	lasagne.layers.Layer

Low-level layers¶

agentnet.memory.GateLayer(gate_controllers, channels, gate_nonlinearities=<function sigmoid>, bias_init=<lasagne.init.Constant object>, weight_init=<lasagne.init.Normal object>, channel_names=None, **kwargs)[source]¶

An overly generic interface for one-step gate, stacked gates or gate applier. If several channels are given, stacks them for quicker execution.

Parameters:

gate_controllers – a single layer or a list/tuple of such layers that gate depends on (for most RNNs, that’s input and previous memory state)
channels – a single layer or integer or a list/tuple of layers/integers if a layer, that defines a layer that should be multiplied by the gate output if an integer that defines a number of units of a gate – and these are the units to be returned
gate_nonlinearities – a single function or a list of such(channel-wise), - defining nonlinearities for gates on corresponding channels
bias_init –
- an initializer or a list (channel-wise) of initializers for bias(b) parameters
- (None, lasagne.init, theano variable or numpy array)
- None means no bias
weight_init –
- an initializer OR a list of initializers for (channel-wise)
- OR a list of lists of initializers (channel, controller)
- (lasagne.init, theano variable or numpy array)