### 技术控

今日:96| 主题:52592

# [其他] Gradient Descent Learns Linear Dynamical Systems

245 6

### 立即注册CoLaBug.com会员，免费获得投稿人的专业资料，享用更多功能，玩转个人品牌！

x
From text translation to video captioning, learning to map one sequence to another is an increasingly active research area in machine learning. Fueled by the success of recurrent neural networks in its many variants, the field has seen rapid advances over the last few years. Recurrent neural networks are typically trained using some form of stochastic gradient descent combined with backpropagation for computing derivatives. The fact that gradient descent finds a useful set of parameters is by no means obvious. The training objective is typically non-convex. The fact that the model is allowed to maintain state is an additional obstacle that makes training of recurrent neural networks challenging.
In this post, we take a step back to reflect on the mathematics of recurrent neural networks. Interpreting recurrent neural networks as dynamical systems, we will show that stochastic gradient descent successfully learns the parameters of an unknown linear dynamical system even though the training objective is non-convex. Along the way, we’ll discuss several useful concepts from control theory, a field that has studied linear dynamical systems for decades. Investigating stochastic gradient descent for learning linear dynamical systems not only bears out interesting connections between machine learning and control theory, it might also provide a useful stepping stone for a deeper undestanding of recurrent neural networks more broadly.
Linear dynamical systems

We focus on time-invariant single-input single-output system. For an input sequence of real numbers $x_1,\dots, x_T\in \mathbb{R}$, the system maintains a sequence of hidden states $h_1,\dots, h_T\in \mathbb{R}^n$, and produces a sequence of outputs $y_1,\dots, y_T\in \mathbb{R}$ according to the following rules:
Here $A,B,C,D$ are linear transformations with compatible dimensions, and $\xi_t$ is Gaussian noise added to the output at each time. In the learning problem, often called system identification in control theory, we observe samples of input-output pairs $((x_1,\dots, x_T),(y_1,\dots y_T))$ and aim to recover the parameters of the underlying linear system.
Although control theory provides a rich set of techniques for identifying and manipulating linear systems, maximum likelihood estimation with stochastic gradient descent remains a popular heuristic.
We denote by $\Theta = (A,B,C,D)$ the parameters of the true system. We parametrize our model with $\widehat{\Theta} = (\hat{A},\hat{B},\hat{C},\hat{D})$, and the trained model maintains hidden states $\hat{h}_t$ and outputs $\hat{y}_t$ exactly as in equation (1). For each given example $(x,y) = ((x_1,\dots,x_T), (y_1,\dots, y_t))$, the log-likelihood of model $\widehat{\Theta}$ is
. The population risk is defined as the expected log-likelihood,
Stochastic gradients of the population risk can be computed in time $O(Tn)$ via back-propagation given random samples. We can therefore directly minimize population risk using stochastic gradient descent. The question is just whether the algorithm actually converges. Even though the state transformations are linear, the objective function we defined is not convex. Luckily, we will see that the objective is still close enough to convex for stochastic gradient to make steady progress towards the global minimum.
Hair dryers and quasi-convex functions

Before we go into the math, let’s illustrate the algorithm with a pressing example that we all run into every morning: hair drying. Imagine you have a hair dryer with a low temperature setting and a high temperature setting. Neither setting is ideal. So every morning you switch between the settings frantically in an attempt to modulate to the ideal temperature. Measuring the resulting temperature (red line below) as a function of the input setting (green dots below), the picture you’ll see is something like this :

You can see that the output temperature is related to the inputs. If you set the temperature to high for long enough, you’ll eventually get a high output temperature. But the system has state. Briefly lowering the temperature has little effect on the outputs. Intuition suggests that these kind of effects should be captured by a system with two or three hidden states. So, let’s see how SGD would go about finding the parameters of the system. We’ll initialize a system with three hidden states such that before training its predictions are just the inputs of the system. We then run SGD with a fixed learning rate on the same sequence for 400 steps.

Evidently, gradient descent converges just fine on this example. Let’s look at the hair dryer objective function along the line segment between two random points in the domain.

1234下一页

dakuba 发表于 2016-10-14 14:49:17
 小白一个 顶一下

aulla 发表于 2016-10-14 17:31:50
 过去的事情可以不忘记，但一定要放下。

 夏天就是不好，穷的时候我连西北风都没得喝……

tong000 发表于 2016-11-17 08:10:52
 将薪比薪想一下，算了，不想活了。

 梦想不能实现，都是因为它不够现实。

 夏天就是不好，穷的时候我连西北风都没得喝……

• ## Microsoft Teams in Office 365

I hope everyone would agree to the fact [...]

• ## SEO排名密码 很多人都犯错了

笔者是学软件开发出身的，毕业之后经过三四年的 [...]

• ## iOS 抓取 HTML ,CSS XPath 解析数据

以前我们获取数据的方式都是使用 AFN 来 Get JSO [...]

• ## 再度思考马云提出的“新零售”，该如何布局

阿里巴巴集团将于2月20日公布其研究多时的“ [...]

• ## [图]流媒体抗议“禁穆令”活动上周被关停

据外媒报道，上周，纽约皇后区移动影像博物馆(Th [...]

© 2001-2017 Comsenz Inc. Design: Dean. DiscuzFans.