# Environment¶

Environment is an MDP abstraction that defines which observations does agent get and how does environment external state
change given agent actions and previous state.

There’s a base class for environment definition, as well as special environments for Experience Replay.

• if it is done entirely in theano, implement BaseEnvironment. See ./experiments/wikicat or boolean_reasoning for example.
• if it isn’t (which is probably the case), use SessionPoolEnvironment to train from recorded interactions as in Atari examples

## BaseEnvironment¶

class agentnet.environment.BaseEnvironment(state_shapes, observation_shapes, action_shapes, state_dtypes=None, observation_dtypes=None, action_dtypes=None)[source]

Base for environment layers. This is the class you want to inherit from when designing your own custom environments.

To define an environment, one has to describe
• it’s internal state(s),
• observations it sends to agent
• actions it accepts from agent
• environment inner logic

States, observations and actions are theano tensors (matrices, vectors, etc), their shapes should be defined via state_shape, state_dtype, action_shape, etc.

The default dtypes are floatX for state and observation, int32 for action. This suits most of the cases, one can usually use inherited dtypes.

Finally, one has to implement get_action_results, which converts a tuple of (old environment state, agent action) -> (new state, new observation)

Developer tips: [when playing with non-float observations and states] if you implemented a new environment, but keep getting a _.grad illegally returned an integer-valued variable exception. (Input index _, dtype _), please make sure that any non-float environment states are excluded from gradient computation or are cast to floatX.

To find out which variable causes the problem, find all expressions of the dtype mentioned in the expression and then iteratively replace their type with a similar one (like int8 -> uint8 or int32) until the error message dtype changes. Once id does, you have found the cause of the exception.

get_action_results(last_states, actions, **kwargs)[source]

Computes environment state after processing agent’s action.

An example of implementation: # a dummy update rule where new state is equal to last state new_states = prev_states # mdp with full observability observations = new_states return prev_states, observations

Parameters: last_states (list(float[batch_id, memory_id0,[memory_id1],..])) – Environment state on previous tick. actions (list(int[batch_id])) – Agent action after observing last state. a tuple of new_states, actions new_states: Environment state after processing agent’s action. observations: What agent observes after committing the last action. tuple of new_states: list(float[batch_id, memory_id0,[memory_id1],..]), observations: list(float[batch_id,n_agent_inputs])
as_layers(prev_state_layers=None, action_layers=None, environment_layer_name='EnvironmentLayer')[source]

Lasagne Layer compatibility method. Understanding this is not required when implementing your own environments.

Creates a lasagne layer that makes one step environment updates given agent actions.

Parameters: prev_state_layers – a layer or a list of layers that provide previous environment state. None means create InputLayers automatically action_layers – a layer or a list of layers that provide agent’s chosen action. None means create InputLayers automatically. environment_layer_name (str) – layer’s name [new_states], [observations]: 2 lists of Lasagne layers new states - all states in the same order as in self.state_shapes observations - all observations in the order of self.observation_shapes

## Experience Replay¶

class agentnet.environment.SessionPoolEnvironment(observations=1, actions=1, agent_memories=1, default_action_dtype='int32', rng_seed=1337)[source]

A generic pseudo-environment that replays sessions loaded via .load_sessions(...), ignoring agent actions completely.

This environment can be used either as a tool to run experiments with non-theano environments or to actually train via experience replay [http://rll.berkeley.edu/deeprlworkshop/papers/database_composition.pdf]

It has a single scalar integer env_state, corresponding to time tick.

The environment maintains it’s own pool of sessions represented as (.observations, .actions, .rewards)

To load sessions into pool, use
• .load_sessions - replace existing sessions with new ones
• .append_sessions - add new sessions to existing ones up to a limited size
• .get_session_updates - a symbolic update of experience replay pool via theano.updates.
To use SessionPoolEnvironment for experience replay, one can
• feed it into agent.get_sessions (with optimize_experience_replay=True recommended) to use all sessions
• subsample sessions via .select_session_batch or .sample_sessions_batch to use random session subsample
[ this option creates SessionBatchEnvironment that can be used with agent.get_sessions ]
During experience replay sessions
• states are replaced with a fake one-unit state
• observations, actions and rewards match original ones
• agent memory states, Q-values and all in-agent expressions (but for actions) will correspond to
what agent thinks NOW about the replay.

Although it is possible to get rewards via the regular functions, it is usually faster to take self.rewards as rewards with no additional computation.

Parameters: observations (int or lasagne.layers.Layer or list of lasagne.layers.Layer) – number of floatX flat observations or a list of observation inputs to mimic actions (int or lasagne.layers.Layer or list of lasagne.layers.Layer) – number of int32 scalar actions or a list of resolvers to mimic agent_memories (int or lasagne.layers.Layer or a list of lasagne.layers.Layer) – number of agent states [batch,tick,unit] each or a list of memory layers to mimic default_action_dtype (string or dtype) – if actions are given as lasagne layers with valid dtypes, this is a default dtype of action Otherwise agentnet.utils.layers.get_layer_dtype is used on a per-layer basis

To setup custom dtype, set the .output_dtype property of layers you send as actions, observations of memories.

WARNING! this session pool is stored entirely as a set of theano shared variables. GPU-users willing to store a __large__ pool of sessions to sample from are recommended to store them somewhere outside (e.g. as numpy arrays) to avoid overloading GPU memory.

load_sessions(observation_sequences, action_sequences, reward_seq, is_alive=None, prev_memories=None)[source]

Load a batch of sessions into env. The loaded sessions are that used during agent interactions.

append_sessions(observation_sequences, action_sequences, reward_seq, is_alive=None, prev_memories=None, max_pool_size=None)[source]

Add a batch of sessions to the existing sessions. The loaded sessions are that used during agent interactions.

If max_pool_size != None, only last max_pool_size sessions are kept.

get_session_updates(observation_sequences, action_sequences, reward_seq, is_alive=None, prev_memory=None, cast_dtypes=True)[source]

Return a dictionary of updates that will set shared variables to argument state. If cast_dtypes is True, casts all updates to the dtypes of their respective variables.

select_session_batch(selector)[source]

Returns SessionBatchEnvironment with sessions (observations, actions, rewards) from pool at given indices.

Parameters: selector – An array of integers that contains all indices of sessions to take.

Note that if this environment did not load is_alive or preceding_memory, you won’t be able to use them at the SessionBatchEnvironment

sample_session_batch(max_n_samples, replace=False, selector_dtype='int32')[source]

Return SessionBatchEnvironment with sessions (observations, actions, rewards) that will be sampled uniformly from this session pool.

If replace=False, the amount of samples is min(max_n_sample, current pool). Otherwise it equals max_n_samples.

The chosen session ids will be sampled at random using self.rng on each iteration. P.S. There is no need to propagate rng updates! It does so by itself. Unless you are calling it inside theano.scan, ofc, but i’d recommend that you didn’t. unroll_scan works ~probably~ perfectly fine btw

agentnet.environment.SessionBatchEnvironment(observations, single_observation_shapes, actions=None, single_action_shapes='all_scalar', rewards=None, is_alive=None, preceding_agent_memories=None)[source]

A generic pseudo-environment that replays sessions defined on creation by theano expressions ignoring agent actions completely.

The environment takes symbolic expression for sessions represented as (.observations, .actions, .rewards) Unlike SessionPoolEnvironment, this one does not store it’s own pool of sessions.

To create experience-replay sessions, call Agent.get_sessions with this as an environment.

Parameters: observations (theano tensor or a list of such) – a tensor or a list of tensors matching agent observation sequence [batch, tick, whatever] observation shapes (single) – shapes of one-tick one-batch-item observations. E.g. if lasagne shape is [None, 25(ticks), 3,210,160], than single_observation_shapes must contain [3,210,160] actions (theano tensor or a list of such) – a tensor or a list of tensors matching agent actions sequence [batch, tick, whatever] action shapes (single) – shapes of one-tick one-batch-item actions. Similar to observations. All scalar means that each action has shape (,), lasagne sequence layer being of shape (None, seq_length) rewards (theano tensor) – a tensor matching agent rewards sequence [batch, tick] is_alive (theano tensor or None) – whether or not session has still not finished by a particular tick. Always alive by default. preceding_agent_memory – a tensor or a list of such storing what was in agent’s memory prior to the first tick of the replay session.

How does it tick:

During experience replay sessions,
• observations, actions and rewards match original ones
• agent memory states, Q-values and all in-agent expressions (but for actions) will correspond to what agent thinks NOW about the replay (not what it thought when he commited actions)
• preceding_agent_memory [optional] - what was agent’s memory state prior to the first tick of the replay session.

Although it is possible to get rewards via the regular functions, it is usually faster to take self.rewards as rewards with no additional computation.