请选择 进入手机版 | 继续访问电脑版

技术控

    今日:152| 主题:53946
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Gradient Descent Learns Linear Dynamical Systems

[复制链接]
蜜彩儿 发表于 2016-10-14 13:20:46
261 6

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.
   In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown linear dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.
  Linear dynamical systems

  We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:
  Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.
  Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.
  We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is
    . The population risk is defined as the expected log-likelihood,
   Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still close enough to convex for stochastic gradient to make steady progress towards the global minimum.
  Hair dryers and quasi-convex functions

   Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a low temperature setting and a high temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like this :
  

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

  You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.
  

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

   The blue line shows the predictions of SGD after  0 /400  gradient updates. Click to advance.
  Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.
  

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

   The function is clearly not convex, but it doesn’t look too bad either. In particular, from the picture, it could be that the objective function is quasi-convex :
      Definition:For $\tau > 0$, a function $f(\theta)$ is $\tau$-quasi-convex with respect to a global minimum $\theta ^ * $ if for every $\theta$,
       Intuitively, quasi-convexity states that the descent direction $-\nabla f(\theta)$ is positively correlated with the ideal moving direction $\theta^* -\theta$. This implies that the potential function $\left|\theta-\theta ^ * \right|^2$ decreases in expectation at each step of stochastic gradient descent. This observation plugs nicely into the standard SGD analysis, leading to the following result:
   Proposition:(informal) Suppose the population risk $f(\theta)$ is $\tau$-quasi-convex, then stochastic gradient descent (with fresh samples at each iteration and proper learning rate) converges to a point $\theta_K$ in $K$ iterations with error bounded by $ f(\theta_K) - f(\theta^*) \leq O(1/(\tau \sqrt{K}))$.
  The key challenge for us is to understand under what conditions we can prove that the population risk objective is in fact quasi-convex. This requires some background.
  Control theory, polynomial roots, and Pac-Man

   A linear dynamical system $(A,B,C,D)$ is equivalent to the system $(TAT^{-1}, TB, CT^{-1}, D)$ for any invertible matrix $T$ in terms of the behavior of the outputs. A little thought shows therefore that in its unrestricted parameterization the objective function cannot have a unique optimum. A common way of removing this redundancy is to impose a canonical form. Almost all non-degenerate system admit the controllable canonical form , defined as
   We will also parametrize our training model using these forms. One of its nice properties is that the coefficients of the characteristic polynomial of the state transition matrix $A$ can be read off from the last row of $A$. That is,
     Even in controllable canonical form, it still seems rather difficult to learn arbitrary linear dynamical systems. A natural restriction would be stability , that is, to require that the eigenvalues of $A$ are all bounded by $1.$ Equivalently, the roots of the characteristic polynomial should all be contained in the complex unit disc. Without stability, the state of the system could blow up exponentially making robust learning difficult. But the set of all stable systems forms a non-convex domain. It seems daunting to guarantee that stochastic gradient descent would converge from an arbtirary starting point in this domain without ever leaving the domain.
  We will therefore impose a stronger restriction on the roots of the characteristic polynomial. We call this the Pac-Man condition. You can think of it as a strengthening of stability.
      Pac-Man condition: A linear dynamical system in controllable canonical form satisfies the Pac-Man condition if the coefficient vector $a$ defining the state transition matrix satisfies
    for all complex numbers $z$ of modulus $|z| = 1$, where $q_a(z) = p_a(z)/z^n = 1+a_1z^{-1}+\dots + a_nz^{-n}$.   

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

  Above, we illustrate this condition for a degree 4 system plotting the value of $q_a(z)$ on complex plane for all complex numbers $z$ on the unit circle.
  We note that Pac-Man condition is satisfied by vectors $a$ with $|a|_1\le \sqrt{2}/2$. Moreover, if $a$ is a random Gaussian vector with expected $\ell_2$ norm bounded by $o(1/\sqrt{\log n})$, then it will satisfy Pac-Man condition with probability $1-o(1)$. Roughly speaking, the assumption requires the roots of the characteristic polynomial $p_a(z)$ are relatively dispersed inside the unit circle.
  The Pac-Man condition has three important implications:
  
       
  •   It implies via Rouche’s theorem that the spectral radius of A is smaller than 1 and therefore ensures stability of the system.
       
  • The vectors satisfying it form a convex set in $\mathbb{R}^n$.
       
  •   Finally, it ensures that the objective function is quasi-convex
      
  Main result

  Relying on the Pac-Man condition, we can show:
      Main theorem (Hardt, Ma, Recht, 2016): Under the Pac-Man condition, projected gradient descent algorithm, given $N$ sample sequences of length $T$, returns parameters $\widehat{\Theta}$ with population risk
       The theorem sorts out the right dependence on $N$ and $T$. Even if there is only one sequence, we can learn the system provided that the sequence is long enough. Similarly, even if sequences are really short, we can learn provided that there are enough sequences.
  Quasi-convexity in the frequency domain

  To establish quasi-convexity under the Pac-Man condition, we will first develop an explicit formula for the population risk in frequency domain. In doing so, we assume that $x_1,\dots, x_T$ are pairwise independent with mean 0 and variance 1. We also consider the population risk as $T\rightarrow \infty$ for simplicity in this post.
  A simple algebraic manipulation simplifies the population risk with infinite sequence length to
   The first term, $(\hat D - D)^2$ is convex and appears nowhere else. We can safely ignore it and focus on the remaining expression instead, which we call the idealized risk :
  To deal with the sequence $\hat{C}\hat{A}^kB$, we take its Fourier transform and obtain that
  Similarly we take the Fourier transform of $CA^kB$, denoted by $G_{\lambda}$. Then by Parseval’s Theorem, we obtain the following alternative representation of the population risk,
  Mapping out $G_\lambda$ and $\widehat G_\lambda$ for all $\lambda\in [0, 2\pi]$ gives the following picture:
  

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems

   Left: Target transfer function $G$. Right: Approximation $\widehat G$ at step 0 /10. Click to advance.
  Given this pretty representation of the idealized risk objective, we can finally prove our main lemma.
   Lemma:Suppose $\Theta$ satisfies the Pac-Man condition. Then, for every $0\le \lambda\le 2\pi$, $|G_{\lambda}-\widehat{G}_{\lambda}|^2$, as a function of $\hat{A},\hat{C}$ is quasi-convex in the Pac-Man region.
  The lemma reduces to the following simple claim.
   Claim:The function $h(\hat{u},\hat{v}) = |\hat{u}/\hat{v} - u/v|^2$ is quasi-convex in the region where $Re(\hat{v}/v) > 0$.
  The proof simply involves computing the gradients and checking the conditions for quasi-convexity by elementary algebra. We omit a formal proof, but intead show a plot of the function $h(\hat{u}, \hat{v}) = (\hat{u}/\hat{v}- 1)^2$ over the reals:
     

Gradient Descent Learns Linear Dynamical Systems

Gradient Descent Learns Linear Dynamical Systems
   Click to rotate.
    To see how the lemma follows from the previous claim we note that quasi-convexity is preserved under composition with any linear transformation. Specifically, $h(z)$ is quasi-convex, then $h(R x)$ is also quasi-convex for any linear map $R$. So, consider the linear map:
  With this linear transformation, our simple claim about a bivariate function extends to show that $(G_{\lambda}-\widehat{G}_{\lambda})^2$ is quasi-convex when $Re(\hat{v}/v) \ge 0$. In particular, when $\hat{a}$ and $a$ both satisfy the Pac-Man condition, then $\hat{v}$ and $v$ both reside in the 90 degree wedge. Therefore they have an angle smaller than 90 degree. This implies that $Re(\hat{v}/v) > 0$.
  Conclusion

   We saw conditions under which stochastic gradient descent successfully learns a linear dynamical system. In our paper , we further show that allowing our learned system to have more parameters than the target system makes the problem dramatically easier. In particular, at the expense of slight over-parameterization we can weaken the Pac-Man condition to a mild separation condition on the roots of the characteristic polynomial. This is consistent with empirical observations both in machine learning and control theory that highlight the effectiveness of additional model parameters.
  More broadly, we hope that our techniques will be a first stepping stone toward a better theoretical understanding of recurrent neural networks.



上一篇:Program development by stepwise refinement
下一篇:Flexible C memory allocation scheme with leak checking
dakuba 发表于 2016-10-14 14:49:17
小白一个 顶一下
回复 支持 反对

使用道具 举报

aulla 发表于 2016-10-14 17:31:50
过去的事情可以不忘记,但一定要放下。
回复 支持 反对

使用道具 举报

时光待我如初 发表于 2016-10-15 05:50:05
夏天就是不好,穷的时候我连西北风都没得喝……
回复 支持 反对

使用道具 举报

tong000 发表于 2016-11-17 08:10:52
将薪比薪想一下,算了,不想活了。
回复 支持 反对

使用道具 举报

董洋 发表于 2016-11-17 18:51:18
梦想不能实现,都是因为它不够现实。
回复 支持 反对

使用道具 举报

刘玉冰 发表于 2016-11-20 22:28:13
夏天就是不好,穷的时候我连西北风都没得喝……
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读

扫码访问 @iTTTTT瑞翔 的微博
回页顶回复上一篇下一篇回列表
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )

© 2001-2017 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表