Lisheng Wu

A Trend Of Moving Deep Learning Towards Multi-Agent Setting

2019-12-24T00:00:00-08:00

Learn Shared Dynamics with Meta-World Models

2018-09-07T00:00:00-07:00

We often consider the features from the appearance. Sometimes, we also try to learn the features correlated to the context, like word embeddings. Words with similar properties can share some similarities in the word embedding space. Now, consider reinforcement learning(RL) environment states from this aspect: If two world environments hold different state space $\mathcal{S}_1$ and $\mathcal{S}_2$, but they have the same action space $\mathcal{A}$ and the agent interacts with the environments in the same way physically. We call the two worlds to have the same dynamics.

For example, you and your reflection in the mirror have the same dynamics, but the observations are reversed horizontally. However, when you consider your reflection in the mirror is yourself, you started to unify the representations of you and it. The example may also be one illustration of the mirror test which shows the existence of self-consciousness.

To explore the same process using machine learning, we proposed to use the shared dynamics to learn a meta-world model. The meta-world model is based on the world model architectures. The world model consists of Vision($\mathcal{V}$) model and Memory($\mathcal{M}$) model. We choose the V model as VAE, and the M model as LSTM in our architecture. Meta-World aims to learn the shared dynamics among multiple worlds so that one M model is shared across different world environments while each world keeps one individual V model. The training loss comes from the reconstruction loss of VAE and the prediction loss of LSTM. In correspondence, the training procedure is divided into reconstruction phase $\mathcal{R}$ and prediction phase $\mathcal{P}$. In the $\mathcal{R}$ phase, we train VAE using reconstruction loss. In the $\mathcal{P}$ phase, we train both VAE and LSTM using prediction loss. The reason why we don’t train the models jointly is that the conflicts between the two kinds of losses can lead the meta-world model to a local minimum with poor performances.

The two training phases can be understood as:

$\mathcal{P}$ phase destroys the reconstruction ability of $\mathcal{V}$ models to be compatible with the shared dynamics;
$\mathcal{R}$ phase recovers the reconstruction ability of $\mathcal{V}$.

Two phases alternate until they reach the balance.

Our experiments are carried on the variants of the Atari Pong environment. Pong environment is preprocessed into binary images at first. Then we have five kinds of variants of it which is shown in Fig.1. $\Gamma_o$ is the original world and five variants are $\Gamma_t$, $\Gamma_h$, $\Gamma_v$, $\Gamma_c$, $\Gamma_m$. $\Gamma_t$ is the transpose of the original world; $\Gamma_v$ divided $\Gamma_o$ into two parts vertically and swap them and $\Gamma_h$ is obtained similarly but horizontally; $\Gamma_c$ changes the colour of $\Gamma_o$; $\Gamma_m$ is $\Gamma_o$’s mirror world. All those worlds have different state space, but with the same action space and the physical meanings of actions are the same.

Fig.1 - five world variants

The problem appears that how can we know we have learned the shared dynamics because the network capacity allows it to learn many dynamics simultaneously. We validate it indirectly by observing whether the representations of corresponding states from different worlds are unified. If they’re similar, they share the same representations, and the dynamic changes between adjacent states are also the same. For the variational inference in VAE, the latent vectors are generated from distribution and KL divergence between two distributions with low variance can be very high. Thus, we choose not to compare the latent vector $z$ of VAE directly. We can decode the $z$ encoded by one world’s V model using another world’s V model and see whether the output is the corresponding state. If it is, we regard it as the symbol that we learn the shared dynamics, and it also leads to the fact that the shared dynamics can unify the representations.

In our most experiments, we choose $\Gamma_o$ and one world from {$\Gamma_t$, $\Gamma_h$, $\Gamma_v$, $\Gamma_c$, $\Gamma_m$} to train the meta-world. The validation results are listed below.

Fig.2 - five world validation

Fig.3 - four worlds comparison($\Gamma_v$ excluded)

In summary, what we only do is let the agent try to find one shared dynamics to associate itself and ‘itself’ in another world, then it’s able to pass the mirror test. We also try to train multiple worlds together and improve the success probability of training multiple worlds. As a result, we also find something interesting which may be updated later or delivered in a new post.

PPO in Particle Environment

2018-05-17T00:00:00-07:00

As we know, reinforcement learning(RL) suffers from large variance and doesn’t have stable learning signals like supervised learning. The learning process appears like a seesaw to me that may make its performances on past experiences worse when learning about new things. I previously wrote a demo to learn to recite the infinite non-repeating decimals $\pi$ sequentially with hidden size 64 and can only recite only up to 1700 digits. The recurrent neural network(RNN) predicts 10 digits at one time and is trained with those 10 digits at each time step. If the network predicts the results correctly, then we move to the next 10 digits. Otherwise, the neural network has 90 percents chance to predict it again after training with the labels and 10 percents to predict from the beginning again. In the test time, RNN recites $\pi$ from the beginning until it outputs the wrong results. Though the training procedure applies supervised learning, it’s also a Markov Decision Process and has something similar to RL. If the network changes its parameters, then the memory passed by the RNN can also vary accordingly, but the changes in memory are not considered in current time step and can influence RNN’s predictions in test time. Though this kind of problems can be alleviated by offline learning and large batch size training, how to balance the reinforcement learning data still remains as one problem.

PPO serves as the basic reinforcement learning algorithm of Open AI. I’d like to interpret PPO in an intuitive ways. The PPO equation is

$L^{CLIP}(\theta)=min(r_t(\theta)\hat{A}_t, clip(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)$.

Further information about PPO please refer to the paper. Assume $A_t$ is greater than zero, then if we don’t clip the ratio $r_t(\theta)$, the optimisation target may sacrifice its performance on some other states for better performance on some states. However, those states are causal and can’t be treated like the data in supervised learning. For example, if we train using mini-batch and the discount factor is one which means the advantages should be the same ideally, we take the $i_{th}$ and $j_{th}$ entry from this batch. Now if $\pi_{old}(a_i|s_i) = 0.8$, $\pi_{old}(a_j|s_j) = 0.1$ and $\epsilon = 0.2$. Then even though $\pi(a_i|s_i)=0.4$ and $\pi(a_j|s_j)=0.3$ ($r_i + r_j = 3.5$, where $r_i=0.5$ and $r_j=3$.), the loss can improve with respect to the two entries but can lead to totally different trajectory. Thus, if we are optimising them together in an off-line manner, a part of data may destroy the learned policy by a significant amount. PPO limits this kind of tradeoff by clip the ratio, then in the same case $r_i + r_j$ are clipped to 1.2, and the loss comes from $r_j$ doesn’t overwhelm that from $r_i$ again.

I found one interesting phenomenon when I used the platform multiple_particle_env to do one experiment. The simple scenario is that two particles know about their own goals and target at reaching their goals among the three landmarks. The particles’ colour is the same as their goals. At every time step, each particle’s reward is represented by the negative distance between it and its goal.

Though DDPG has shown its capacity to solve this kind of problems, we try to solve them using traditional policy gradients architectures. I implemented it with vanilla policy gradients at first but find two agents go to the same places which are not any landmarks without communications. After analysis, their final positions seem to be near the centre of three landmarks, which is denoted as one black circle. The agents can’t learn from differences of different goals because agents with different goals go to nearly the same places. The performance of that one is displayed in the following. Further information about PPO please refer to the paper. Assume $A_t$ is greater than zero, then if we don’t clip the ratio $r_t(\theta)$, the optimisation target may sacrifice its performance on some other states for better performance on some states. However, those states are causal and can’t be treated like the data in supervised learning. For example, if we train using mini-batch and the discount factor is one which means the advantages should be the same ideally, we take the $i_{th}$ and $j_{th}$ entry from this batch. Now if $\pi_{old}(a_i|s_i) = 0.8$, $\pi_{old}(a_j|s_j) = 0.1$ and $\epsilon = 0.2$. Then even though $\pi(a_i|s_i)=0.4$ and $\pi(a_j|s_j)=0.3$ ($r_i + r_j = 3.5$, where $r_i=0.5$ and $r_j=3$.), the loss improve with respect to the two entries but the new policy can lead to totally different trajectory. Thus, if we are optimising them together in an off-line manner, a part of data may destroy the academic policy by a significant amount. PPO limits this kind of tradeoff by clip the ratio, then $r_j$ here is clipped to 1.2 while $r_i$ is still 0.5. The loss comes from $r_j$ doesn’t overwhelm $r_i$ with the clipped ratio.

Fig.1 - Vanilla Policy Gradients

The agent can’t find better learning signals from the data, so it compromises to try to find the center of three landmarks to avoid more severe punishments. It appears like a bug in my program. However, when I and my colleague use PPO instead, the particles can reach their goals exactly. The comparison provides one indirect proof that the vanilla policy gradients may meet the tradeoff problems mentioned above. The performance of PPO is shown in the following.

Fig.2 - PPO

Tracking like a Game

2018-03-03T00:00:00-08:00

I previously implemented one real time pedestrian tracking system with my colleagues which served as the basis of the MOT16 winner. The tracking was realized by matching the object features extracted from CNN using ROI Pooling. It’s good enough but has some disadvantages. One of them is that the features may change when the target objects change its shape. For example, one pedestrian bend down or turn another direction. Another one can be that when two objects overlap, it’s easy to mistake one for another one later.

In Atari Game, we try to obtain rewards by controlling the agents. Similarly, in tracking problems, we can also consider moving bounding boxes in one specific distance at one direction. From this aspect, I implemented one tracking demo based on reinforce algorithm. The scenario is that one large bounding box which represents the receptive field aims at keeping the blue target rectangle around its center. The information captured by local receptive field can help the agents to catch up with the target easily. The code is available in my DeepWhat repository.

There are some more possible extensions. For more complex fixed scenarios, we can try to make some simulations in the background environment to help the agents learn to distinguish those background features. For multiple objects, Recurrent Neural Networks are necessary to keep the information of each moving object and can help to distinguish each object after they overlapped.