请选择 进入手机版 | 继续访问电脑版

技术控

    今日:50| 主题:54631
收藏本版 (1)
最新软件应用技术尽在掌握

[其他] Hyper Networks

[复制链接]
終究還是敷衍 发表于 16-9-30 16:14:07
422 6

立即注册CoLaBug.com会员,免费获得投稿人的专业资料,享用更多功能,玩转个人品牌!

您需要 登录 才可以下载或查看,没有帐号?立即注册

x
In this post, I will talk about our recent paper called [1609.09106] HyperNetworks . I worked on this paper as a Google Brain Resident - a great research program where we can work on machine learning research for a whole year, with a salary and benefits! The Brain team is now accepting applications for the 2017 program: see g.co/brainresidency .
  Introduction

        A Dynamic Hypernetwork Generating Handwriting.
   The weight matrices of the LSTM are changing over time.
     Most modern neural network architectures are either a deep ConvNet, or a long RNN, or some combination of the two . These two architectures seem to be at opposite ends of a spectrum. Recurrent Networks can be viewed as a really deep feed forward network with the identical weights at each layer (this is called weight-tying ). A deep ConvNet allows each layer to be different. But perhaps the two are related somehow. Every year, the winning ImageNet models get deeper and deeper. Think about a deep 110-layer, or even 1001-layer Residual Network architectures we keep hearing about. Do all 110 layers have to be unique? Are most layers even useful?
   People have already thought of forcing a deep ConvNet to be like an RNN, i.e. with identical weights at every layer. However, if we force a deep ResNet to have its weight tied, the performance would be embarrassing. In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.
   While neural net compression is a nice topic, I am more interested to do things that are a bit more against the grain, as you know from reading my previous blog posts. There are many algorithms that already take a fully trained network, and then apply compression methods to the weights of the pre-trained network so that it can be stored with fewer bits. While these approaches are useful, I find it much more interesting to start from a small number of parameters, and learn to construct larger and complex representations from a them. Many beautiful complex in the natural world can be constructed using a small set of simple rules. Digital artists have also designed beautiful generative works based on this concept. It is this type of complex-abstraction concept that I want to explore in my machine learning research. In my view, neural network decompression is a more interesting problem than compression. In particular, I also want to explore decompressing the weights of an already compressed neural net, i.e., the weights a recurrent network.
  The more exciting work is in the second part of my paper where we apply Hypernetworks to Recurrent Networks. The weights of an RNN are tied at each time-step, limiting its expressivity. What if we can have a way to allow the weights of an RNN to be different at each time step (like a deep convnet), and also for each individual input sequence?
   The main result in the paper is to challenge the weight-sharing paradigm of Recurrent Nets. We do this by embedding a small Hypernetwork inside a large LSTM, so that the weights used by the main LSTM can be modified accordingly by the Hypernetwork, whenever it feels like modifying the weights. In this instance, the Hypernetwork is also another LSTM, just a smaller version, and we will give it the power to modify the weights of the main LSTM, at each timestep, and also for each input sequence. In the process, we achieve state-of-the-art results for character level language modelling tasks for Penn Treebank and Wikipedia datasets. More importantly, we explore what happens when our models are able to generate generative models.
   The resulting HyperLSTM model looks and feels like a normal generic TensorFlow RNN cell. Just like how some people among us have super-human powers, in the HyperLSTM model, if the main LSTM is a human brain, then the HyperLSTM is some weird intelligent creature controlling the brain from within.
     

Hyper Networks

Hyper Networks-1-技术控-Windows,learning,research,opposite,benefits
   I, for one, welcome our new Arquillian overlords.    Background

   The concept of using a neural network to generate the weights of a much larger neural network originated from Neuroevolution. While genetic algorithms are easy and fun to use, it is difficult to get them to directly find solutions for a really large set of model parameters. Ken Stanley came up with a brilliant method called HyperNEAT to address this problem. He came up with this method while trying to use an algorithm calledNEAT to create beautiful neural network generated art . NEAT is an evolved neural network that can be used togenerate art by giving it the locations of each pixel and reading from its output the colour of that pixel. What if we get NEAT to paint the weights of a weight matrix instead? HyperNEAT attempts to do this. It uses a small NEAT network to generate all the weight parameters of a large neural network. The small NEAT network usually consists of less than a few thousand parameters, and its architecture evolved to produce the weights of a large network, given a set of virtual coordinate information about each weight connection.
   While the concept of parameterizing a large set of weights into a small number of parameters is indeed very useful in Neuroevolution, some other researchers thought that HyperNEAT can be a bit of an overkill. Writing and debugging NEAT can be a lot of work, and selective laziness can go a long way in research. Schmidhuber’s group decided to try an alternative method, and just use Discrete Cosine Transform to compress a large weight matrix so that it can be approximated by a small set of coefficients (this is how JPEG compression works). They then use genetic algorithms to solve for the best set of coefficients so that the weights of a recurrent network is good enough to drive a virtual car around the tracks inside the TORCS simulation. They basically performed JPEG compression on the weight matrix of a self-driving car.
   Some people, including my colleagues at Deepmind have also played around with the idea of using HyperNEAT to evolve a small weight-generating network, but use back propagation instead of genetic algorithms to solve for the weights for the small network. They summarized some cool results in their paper about DPPN s.
  Personally, I’m more interested to explore another aspect of neural network weight generation. I like to view a neural net as a powerful type of computing device, and the weights of the neural net are kind of like the machine-level instructions for this computer. The larger the number of neurons, the more expressive the neural net becomes. But unlike machine instructions of a normal computer, neural network weights are also robust, so if we add noise to some part of the neural network, it can still function somewhat. For this reason I find the concept of expressing a large set of weights with a small number of parameters fascinating. This form of weight-abstraction is sort of like coming up with a beautiful abstract higher level language (like LISP or Python) that gets compiled into raw machine-level code (the neural network weights). Or from a biological viewpoint, it is also like how large complex biological structures can be expressed at the genotype level.
  Static Hypernetworks

   In the paper, I decided to explore the concept of having a network to generate weights for a larger network, and try to develop this concept a bit further. However, I took a slightly different approach from HyperNEAT and DPPNs. As mentioned above, these methods take a set of virtual input coordinates as input to a small network to generate the weights of a large network. I played around with this concept extensively (See Appendix Section A of the paper ), but it just didn’t work well for modern architectures like deep ConvNets and LSTMs. While HyperNEAT and DCT can produce good looking weight matrices, due to the prior enforced by smooth functions such as sinusoids, this artistic smoothness is also its limitation for many practical applications. A good looking weight matrix is not useful if it doesn’t work. Look at this picture of the weights of a typical Deep Residual Network:
   

Hyper Networks

Hyper Networks-2-技术控-Windows,learning,research,opposite,benefits
Figure: Images of a 16x16x3x3, and 32x32x3x3 Weight Kernels in typical Residual Network trained on CIFAR-10.
   While I won’t want to hang pictures of ResNet weights in my living room wall, they work really well, and I want to generate pictures of these weights with less parameters. I took an approach that is simpler and more in the fashion of VAE or GAN -type approaches. More modern generative models like GANs and VAEs take in a smallish embedding vector Z, of say 64 numbers, and from these 64 values, try to generate realistic images of cats or other cool things. Why not also try to generate weight matrices for a ResNet? So the approach we take is also to train a simple 2-layer network to generate the 16x16x3x3 weight kernels with an embedding vector of 64 numbers. The larger weight kernels will just be constructed by tiling small versions together (ie, the one on the right will require 256 numbers to generate). We will use the same 2-layer network to generate each and every kernel of a deep ResNet. When we train the ResNet to do image classification, rather than training the ResNet weights directly, we will be training the set of Z’s and the parameters of this 2-layer network instead
   

Hyper Networks

Hyper Networks-3-技术控-Windows,learning,research,opposite,benefits
Figure: Images of generated 16x16x3x3, and 32x32x3x3 Weight Kernels for a ResNet trained on the same task.
   We experimented with a typical off-the-shelf ResNet (the “WRN-40-2” configuration from this nice ResNet variation called Wide Residual Networks ), of which 36 layers are these types of weight kernels. The best test classification accuracy on CIFAR-10 at the time of writing is ~ 96% using tens of millions of parameters. This particular ResNet uses only ~ 2.2 million parameters and can be trained to get ~ 94% accuracy on CIFAR-10, which I think is quite good. Our version of this ResNet that uses hypernet-generated weights used merely ~ 150k parameters, while the accuracy is still respectable. Our model got ~ 93% test accuracy.
   

Hyper Networks

Hyper Networks-4-技术控-Windows,learning,research,opposite,benefits
Figure: Structure of the Wide ResNet Family. We used N=10 and k=1.
   These results makes me think about these super deep 1001 layer ResNets that perform really well on ImageNet contests. Perhaps most of their weights are not that useful, but actually having the weights there, as a kind of placeholder, so that a large number of neurons can compute is the useful bit of why they are so good.
  Dynamic Hypernetworks

  As mentioned in the Introduction, we also tried to apply Hypernetworks on Recurrent Networks, and I feel this is the main contribution of the research. One of the insights from working with hypernetworks on ResNets is that while we get to use much fewer parameters in the model, we see a reduction of accuracy as a tradeoff. So what if we go the other way instead? If we can use a hypernetwork to allow us to relax the weight-sharing constraints of an RNN, and allow the weight matrix to change at each unrolled timestep, it would look closer like a deep ConvNet, so maybe we can squeeze better results from them.
   Figure: The HyperRNN system. The black system represents the main RNN while the orange system represents the weight-generating HyperRNN cell.
  Our approach is to put a small LSTM cell (called the HyperLSTM cell) inside a large LSTM cell (the main LSTM). The HyperLSTM cell will have its own hidden units and its own input sequence. The input sequence for the HyperLSTM cell will be constructed from 2 sources: the previous hidden states of the main LSTM concatenated with the actual input sequence of the main LSTM. The outputs of the HyperLSTM cell will be the embedding vector Z that will then be used to generate the weight matrix for the main LSTM.
  Unlike the Static Hypernetwork, the weight-generating embedding vectors are not kept constant, but will be dynamically generated by the HyperLSTM cell. This allows our model to generate a new set of weights at each time step and for each input example. In the paper I discuss many practicalities and more computationally and memory efficient approach of generating the weights from the embedding vector to simply and reduce computation constraints of this approach. One thing that I learned is that while it is important to dream up new types of algorithms and new approaches in research, at the end of the day it is important to keep things practical and make stuff work. It is also important to make it easy for other people to use your work.
   For our implementation of Dynamic Hypernetworks, we made it so that we can just plug our HyperLSTM cell into any TensorFlow code written to use tf.nn.rnn_cell objects, since the HyperLSTM inherited from this abstract class. This makes it easy to plug my research code to existing code that was designed to use the vanilla LSTM cell. For example, when I was experimenting with our HyperLSTM cell on the wikipedia dataset, I just used  char-rnn-tensorflow  and plugged the research model right in for training and inference. Here is a passage that char-rnn-tensorflow generated with our HyperLSTM model after training on the wikipedia enwik8 dataset:
   Figure: Generated text, along with levels of weight-changing activity of the main LSTM’s weight matrices. Somehow HyperLSTM learned to put Soviet, Facism, a computer company, and an all-important type of machine in one sentence.
   In the above figure, in addition to just displaying the generated text, we can also visualise how the weights of the main LSTM are being modified by the HyperLSTM cell. I chose to visualise the changes of the four hidden-to-gate weight matrices of the LSTM over time in four different colours, to represent each of the four input, candidate, forget and output gates of the LSTM (see this blog post for a great explanation). We can interpret high intensity regions as instances where the HyperLSTM cell just made large changes to the weights of the main LSTM, before the main LSTM is used to generate each character. A low intensity means the HyperLSTM cell is taking a break, so the weights of the main LSTM are not being changed that much during these breaks. Below is another example passage:
   
   An interesting thing to note is during the less active periods of the HyperLSTM cell, the types of words seem to be more predictable. For example, from the first example, Microsoft Windows was generated by more or less a static network after Micros . In the second example, elections in the early 1980s was generated by a relatively constant main LSTM, but right after the 1980s, the HyperLSTM cell suddenly woke up and decided to give the main LSTM model a bit of a shake, before it went to discuss savage employment concerns. In a way, the HyperLSTM cell is generating the generative model as the generative model is generating the sequence.
   This meta-ability to dynamically generate the generative model seems to be very powerful, and in fact our HyperLSTM model was able to beat previous state-of-the-art character-level prediction dataset benchmarks such as Character-Level Penn Treebank and Hutter Prize Wikipedia ( enwik8 ). Our model got 1.25 bpc and 1.38 bpc respectively (as of 27-Sep-2016) without using dynamic evaluation , beating previous records of 1.27 and 1.40 (as of 10-Sep-2016).
   Someone else will probably beat these state-of-the-art numbers in a few weeks given the fast pace of the machine learning research field, and also the fact that ICLR 2017 deadlines are just around the corner. In fact, I don’t really think beating the state-of-the-art on some text dataset is as important as exploring the concept of this multi-level dynamic model within a dynamic model abstraction . I think in the future, people might focus less on architecture design, and the focus will move towards two directions, either towards the application side, or more towards the fundamental building block side. The thing I like about our approach is we effectively created a building block called the HyperLSTM, which from the TensorFlow user’s point of view, looks and feels like exactly like a normal LSTM cell. It is just as easy to plug-and-play the HyperLSTM into existing TensorFlow code, as changing between RNN, GRU, and LSTM cells, since we made HyperLSTM to be just an instance of tf.nn.rnn_cell.RNNCell , called HyperLSTMCell (which contains the full system, not to be confused with the HyperLSTM Cell).
  Generating Generative Models

   I also experimented HyperLSTM to perform the handwriting generation task. In a previouspost, I explored Alex Graves’ approach of getting an LSTM to generate a random handwriting sequence. The way this generation works is to model the
   and    coordinates of the pen stroke as a 2D mixture Gaussian distribution, along with a binary Bernoulli random variable to model the probability that the pen stays on the paper.
   

Hyper Networks

Hyper Networks-5-技术控-Windows,learning,research,opposite,benefits
Figure: Handwriting sampled from a 2D mixture Gaussian distribution, and the Bernoulli distribution, using the vanilla LSTM model. Both the Gaussian and Bernoulli probability distributions change over time.
   During handwriting, the parameters of these two distributions will change over time, and also depend on each other. For example, as you are finish writing a word, the probability that your pen leave the paper increases, and the next location of the pen will likely be further away from where it is now and have a much higher variance. We can get an LSTM to output the parameters for both the mixture Gaussian and the Bernoulli distribution, and have the values of these parameters change at each timestep depending on the LSTM’s internal states. You can visualize how the Gaussian distribution changes over time by looking at the red bubbles in the above figure, which indicate the location and size of the Gaussian distributions for next pen stroke. We can sample from this time-varying distribution and hence sample an entire fake handwriting passage by connecting the sampled points. I view this type of model to be similar to Dynamic Hypernetworks, since an LSTM is dynamically changing the parameters of a generative distribution (the Gaussian and the Bernoulli) over time, and by training the entire model on a handwriting dataset , it can perform a formidable job of generating handwriting samples.
   In our paper, we applied Dynamic Hypernetworks to extend this approach. We will replace BasicLSTMCell in my code with HyperLSTMCell . In this approach, the weight matrices and biases of the LSTM will be modified over time by a smaller LSTM. By doing this simple change, we extend this model-generating-model approach by another level, by having a small LSTM dynamically generate a larger LSTM model at each time step, and for the generated large LSTM model to generate parameters for the Gaussian and Bernoulli distributions also at each time step. So model-generating-model becomes model-generating-model-generating-model.
   Similar to text-generation earlier, this newer approach achieves much better scores compared to the normal, and even multi-layer LSTM’s, with a similar number of training parameters. I modified write-rnn-tensorflow to replicate the exact experiment done in section 4.2 of Graves’ paper , and checked that the Log Loss results when using BasicLSTMCell is similar enough to the previously published results. After tying out with the previous historical results, we can switch BasicLSTMCell to HyperLSTMCell and rerun the experiments. But before doing that, we tried to improve our baseline method first. It is important show the baseline technique some respect, and give face to them, since they led way to our research. We found that applying black magic techniques like data-augmentation and dropout, we can already improve the scores for the baseline 1-Layer BasicLSTMCell from -1026 nats to -1055 nats. After switching on our model, the HyperLSTMCell got the Log Loss score all the way down to -1162 nats, a large and meaningful improvement. Along with the quantitative results are various actual generated samples from various models you can check out in the paper (in the Appendix section).
   To wrap up this post, I made a small demo showing how handwriting generation process with HyperLSTMs. I want to show how the weights of the main LSTM is being modified by the HyperLSTM cell with this demo. Unlike character-generation, the time-axis doesn’t really correspond exactly to the x-axis for handwriting, so I found it easier to visualise this process as a web-demo, since you can’t really do animations for .pdf papers submitted to arXiv. In the future, more science will move more towards web-posts rather than static .pdf files for journals and conferences. My colleagues @ch402 and @shancarter also recently created a platform called distill.pub to encourage the more modern web-based publication format. I will start using this platform in the future.
  You can see the handwriting being generated as well as changes being made to the LSTM’s 4 hidden-to-gate weight matrices. I chose to only visualize the changes made to
   ,   ,   ,   of the main LSTM in the four different colours, although in principle   ,   ,   ,    and all the biases can also be visualized as well. Higher intensity means the HyperLSTM cell is making a larger modification to the weights of the main LSTM.
  Local Implementation

   I learned a lot from reading open-source code and tutorials, and for practical Recurrent Networks in TensorFlow I highly recommend Denny Britz’s blog , the mysterious r2rt super-blog, this blog post on Recurrent Batch Norm , and also TensorFlow With The Latest Papers Implemented .
   This local implementation of HyperLSTMCell is based on the Layer Norm implementation by LeavesBreathe and Batch Norm code by OlavHN . You can also try to plug HyperLSTMCell into  char-rnn-tensorflow  , or use in other interesting tasks. It works, but currently is not as fast as vanilla LSTM, but over time I expect to see many improvements in core TensorFlow that allow for more speedy optimisations.
   I tried to redo the Character Penn Treebank experiment on a local desktop equipped with a Titan X GPU with the a clean open source development stack. I used the same setup as the paper - a hidden unit size of 1000 for the main LSTM, batch sizes of 128, sequence length of 100, learning rate of 0.001, and dropout keep probabilities of 0.90 for training the model on the training set. The hypernetwork has 128 units and generate embedding vectors of size 4 (these two parameters are optional parameters for HyperLSTMCell ). I trained the model for two days and recorded results for the validation set at intervals of 500 steps. For evaluation on the validation set, I used batch sizes of 20 and sequence lengths of 2000.
   

Hyper Networks

Hyper Networks-6-技术控-Windows,learning,research,opposite,benefits
Figure: TensorBoard Results for Character PTB Validation Set (BPC versus training step).
   The best model on the validation set occurred after training for 37k steps (after a day of training) and this achieved a validation score of 1.282 bpc. Using this model on the test set, on a single batch of the full sequence length of 449945 (no mini-batches, so this takes a while in CPU mode), the result on the test set was 1.249256 bpc, slightly better than the published state-of-the-art results in the paper. The model only sees the test set once, unlike methods that use dynamic evaluation , where some people decide to train their models on the test set, and report results on the test set (note: this is not cool). I tested HyperLSTMCell on TensorFlow 0.90. Later on, I’ll try to release the patches I made for char-rnn-tensorflow that does the train/validation/test split properly once I clean it up a bit. This should make it easier for others to conduct research on character level language models in TensorFlow, and to beat our results in the future.
  Sample Generated Char PTB Text

   The model generates text that look like the below. Can you tell the difference between the generated text and the actual dataset ? Maybe we can still tell the difference.
   
  1. end of the third quarter of compared with mr. guber says
  2. he succeeds mr. peters last week to complete the daily <unk> rule now which would n't prefer to prevent a european company to continue its <unk> and initial publications history co. produces about cash to lock in subs questions
  3. he specializes in <unk> line notes will be <unk> <unk> as powerful as the sybbel in europe
  4. china <unk> put itself up on galileo in imports according-to the industry loss means that neither mr. guzman cabrera can have a common deal e of his portrayal as a great rule on the wall street journal 's <unk> page owners
  5. mutual funds are translated into release over the next three soptimes home capital stocks and their murals
  6. eath decade to their own interests moody 's figures <unk> said soviet securities greens james <unk> putting into the streets ahead of the 1953s
  7. but the editoria says also sold six members on a multibillion-dollar write-down in next year
  8. on sept. N a N bid also had backed a record for citizens feel this decade
  9. for example agreements tend to manage th from bankruptcy-law proceedings are reflected in aggressively and authorized loans on sept. N said further research capital holding corp. declined to comment on the defeat
  10. for the third quarter western union had o N billion mark based on investment <unk> N N while the company 's interests in N because the company 's total earnings were increased because of increasing commitment from a subsidiary in N can get a paised takeover
  11. the items have been focusing on unprofitable budgets are attached to the bank of england
  12. futures prices helped last friday because cell scores surged in july does the problem would have to be the same
  13. and the <unk> stock woite high-priced goods were higher inflation and the new environmentalists here say around a view of strong high-risk junk bonds at painewebber analysts and electronics concern operates vice president said the new line of participation in belgium 's long-term sales of most machinery at fax <unk> seem useful
  14. <unk> taipei and other systems recreated the remaining huge <unk> equipment casts <unk> firms to introduce its truck
  15. the dozen then growth stocks is mixed in particular flags
  16. consider the cathay 's interests slipped N N in the latest quarter but acquired most <unk> regulations earlier this month that include a somewhat premium to be replaced by the <unk> & co. spinoff consent walter <unk> <unk> a vice president and bank affiliate of private challenges that a fuston railroad will continue as a new former brewer <unk> an adequate assistant to vesoera <unk> who <unk> <unk> spirits linked about one of the oueriaged cells in the bay bridge as well
  17. as long as it is <unk> by william <unk> who is <unk> towers <unk> fishing <unk> and <unk> are in a <unk> of how <unk> <unk> in <unk> only five <unk> oil mind
  18. now when it was only part of the research and service that are culling up the losses as if the fight curb
  19. <unk> owner of <unk> spielvogel approached line-item veto before he prosecuted a collective session with several theater
  20. justice department crime <unk> would take her note to <unk> secret authority over all the pregnancy mr. bors will soon have an opportunity to succeed
  21. mca advanced N N to N N N dallas investors are left by <unk> turner europe and kidder ginning agreements to buy more revenue
  22. so he said however sales improved in environmental protection feel slightly from # N million in the year-ago period
  23. after raising both the purchase warrants might have alive losses in corporate tax has been achieved
  24. but mr. krenz wanted to pirsue the effort without exercise with six blocks after it was distance in the second quarter as well as taxes and for accelerated funds for u.s. operations in its top stocks
  25. <unk> <unk> had a portion of the tax cut in defense spending on the operations destroyed by northeast
  26. foreign bonds holding inc. and manville said reports of stocks in the bonds or canceled hong kong ambrica ind. in <unk> <unk> mass
  27. meanwhile industrialized louns are sold at and on the company buck <unk>
  28. the pacific stock exchange closed at $ N including mining after options clearing told defend the names of jaguar managed banks
  29. instead of the judge <unk> raises report to the eastern bill
  30. dr. levy prefers teachers to <unk> from inflation that are literally driven to all the <unk> says <unk> <unk> managing director and tax foud <unk> lware or <unk> activity in the works
  31. earlier this year that does n't reflect shearson 's international was the nation 's capital into expuration
  32. in the year-earlier quarter for each white money market unveiled research and development systems is facing a N N u.s. premium over the market 's traditional very budget pilot showed demand for painewebber inc. and b.a.t industries plc of palo alto and globe
  33. america 's election democrats and <unk> fraud is the request of president expanded shopping economic interests but the <unk> hit during seven see
复制代码
    HyperLSTMCell

   Updated versions will be at https://github.com/hardmaru/supercell/
   
  1. # HyperLSTM
  2. # 27-Sep-2016
  3. # https://arxiv.org/abs/1609.09106
  4. #
  5. # latest at https://github.com/hardmaru/supercell
  6. #
  7. # derived with help from
  8. # https://github.com/OlavHN/bnlstm
  9. # https://github.com/LeavesBreathe/tensorflow_with_latest_papers
  10. import tensorflow as tf
  11. import numpy as np
  12. # Orthogonal Initializer from
  13. # https://github.com/OlavHN/bnlstm
  14. def orthogonal(shape):
  15.   flat_shape = (shape[0], np.prod(shape[1:]))
  16.   a = np.random.normal(0.0, 1.0, flat_shape)
  17.   u, _, v = np.linalg.svd(a, full_matrices=False)
  18.   q = u if u.shape == flat_shape else v
  19.   return q.reshape(shape)
  20. def orthogonal_initializer(scale=1.0):
  21.   def _initializer(shape, dtype=tf.float32):
  22.     return tf.constant(orthogonal(shape)*scale, dtype)
  23.   return _initializer
  24. class LSTMCell(tf.nn.rnn_cell.RNNCell):
  25.   def __init__(self, num_units, forget_bias=1.0,
  26.     use_recurrent_dropout=False, dropout_keep_prob=0.9):
  27.     self.num_units = num_units
  28.     self.forget_bias=forget_bias
  29.     self.use_recurrent_dropout=use_recurrent_dropout
  30.     self.dropout_keep_prob=dropout_keep_prob
  31.   @property
  32.   def state_size(self):
  33.     return 2 * self.num_units
  34.   @property
  35.   def output_size(self):
  36.     return self.num_units
  37.   def __call__(self, x, state, scope=None):
  38.     with tf.variable_scope(scope or type(self).__name__):
  39.       c, h = tf.split(1, 2, state)
  40.       h_size = self.num_units
  41.       x_size = x.get_shape().as_list()[1]
  42.       w_init=orthogonal_initializer(1.0)
  43.       #w_init=tf.constant_initializer(0.0)
  44.       #w_init=tf.random_normal_initializer(stddev=0.01)
  45.       #w_init=None # uniform
  46.       h_init=orthogonal_initializer(1.0)
  47.       #h_init=tf.constant_initializer(0.0)
  48.       #h_init=tf.random_normal_initializer(stddev=0.01)
  49.       #h_init=None # uniform
  50.       W_xh = tf.get_variable('W_xh',
  51.         [x_size, 4 * self.num_units], initializer=w_init)
  52.       W_hh = tf.get_variable('W_hh',
  53.         [self.num_units, 4 * self.num_units], initializer=h_init)
  54.       bias = tf.get_variable('bias',
  55.         [4 * self.num_units], initializer=tf.constant_initializer(0.0))
  56.       concat = tf.concat(1, [x, h])
  57.       W_full = tf.concat(0, [W_xh, W_hh])
  58.       hidden = tf.matmul(concat, W_full) + bias
  59.       i, j, f, o = tf.split(1, 4, hidden)
  60.       if self.use_recurrent_dropout:
  61.         g = tf.nn.dropout(tf.tanh(j), self.dropout_keep_prob)
  62.       else:
  63.         g = tf.tanh(j)
  64.       new_c = c*tf.sigmoid(f+self.forget_bias) + tf.sigmoid(i)*g
  65.       new_h = tf.tanh(new_c) * tf.sigmoid(o)
  66.       return new_h, tf.concat(1, [new_c, new_h]) # fuk tuples.
  67. # support functions for layer norm
  68. def moments_for_layer_norm(x, axes=1, name=None):
  69.   #output for mean and variance should be [batch_size]
  70.   # from https://github.com/LeavesBreathe/tensorflow_with_latest_papers
  71.   epsilon = 1e-3 # found this works best.
  72.   if not isinstance(axes, list): axes = list(axes)
  73.   with tf.op_scope([x, axes], name, "moments"):
  74.     mean = tf.reduce_mean(x, axes, keep_dims=True)
  75.     variance = tf.sqrt(tf.reduce_mean(tf.square(x-mean), axes, keep_dims=True)+epsilon)
  76.     return mean, variance
  77. def layer_norm(input_tensor, scope="layer_norm", alpha_start=1.0, bias_start=0.0):
  78.   # derived from:
  79.   # https://github.com/LeavesBreathe/tensorflow_with_latest_papers, but simplified.
  80.   with tf.variable_scope(scope):
  81.     input_tensor_shape_list = input_tensor.get_shape().as_list()
  82.     num_units = input_tensor_shape_list[1]
  83.     alpha = tf.get_variable('layer_norm_alpha', [num_units],
  84.       initializer=tf.constant_initializer(alpha_start))
  85.     bias = tf.get_variable('layer_norm_bias', [num_units],
  86.       initializer=tf.constant_initializer(bias_start))
  87.     mean, variance = moments_for_layer_norm(input_tensor,
  88.       axes=[1], name = "moments_"+scope)
  89.     output = (alpha * (input_tensor-mean))/(variance)+bias
  90.   return output
  91. def super_linear(x, output_size, scope=None, reuse=False,
  92.   init_w="ortho", weight_start=0.0, use_bias=True, bias_start=0.0):
  93.   # support function doing linear operation.  uses ortho initializer defined earlier.
  94.   shape = x.get_shape().as_list()
  95.   with tf.variable_scope(scope or "linear"):
  96.     if reuse == True:
  97.       tf.get_variable_scope().reuse_variables()
  98.     w_init = None # uniform
  99.     x_size = shape[1]
  100.     h_size = output_size
  101.     if init_w == "zeros":
  102.       w_init=tf.constant_initializer(0.0)
  103.     elif init_w == "constant":
  104.       w_init=tf.constant_initializer(weight_start)
  105.     elif init_w == "gaussian":
  106.       w_init=tf.random_normal_initializer(stddev=weight_start)
  107.     elif init_w == "ortho":
  108.       w_init=orthogonal_initializer(1.0)
  109.     w = tf.get_variable("super_linear_w",
  110.       [shape[1], output_size], tf.float32, initializer=w_init)
  111.     if use_bias:
  112.       b = tf.get_variable("super_linear_b", [output_size], tf.float32,
  113.         initializer=tf.constant_initializer(bias_start))
  114.       return tf.matmul(x, w) + b
  115.     return tf.matmul(x, w)
  116. class LayerNormLSTMCell(tf.nn.rnn_cell.RNNCell):
  117.   def __init__(self, num_units, forget_bias=1.0,
  118.     use_recurrent_dropout=False, dropout_keep_prob=0.90):
  119.     self.num_units = num_units
  120.     self.forget_bias = forget_bias
  121.     self.use_recurrent_dropout = use_recurrent_dropout
  122.     self.dropout_keep_prob = dropout_keep_prob
  123.   @property
  124.   def input_size(self):
  125.     return self.num_units
  126.   @property
  127.   def output_size(self):
  128.     return self.num_units
  129.   @property
  130.   def state_size(self):
  131.     return 2 * self.num_units
  132.   def __call__(self, x, state, timestep = 0, scope=None):
  133.     with tf.variable_scope(scope or type(self).__name__):  # "BasicLSTMCell"
  134.       h, c = tf.split(1, 2, state)
  135.       h_size = self.num_units
  136.       x_size = x.get_shape().as_list()[1]
  137.       w_init=orthogonal_initializer(1.0)
  138.       #w_init=tf.constant_initializer(0.0)
  139.       #w_init=tf.random_normal_initializer(stddev=0.01)
  140.       #w_init=None # uniform
  141.       h_init=orthogonal_initializer(1.0)
  142.       #h_init=tf.constant_initializer(0.0)
  143.       #h_init=tf.random_normal_initializer(stddev=0.01)
  144.       #h_init=None # uniform
  145.       W_xh = tf.get_variable('W_xh',
  146.         [x_size, 4 * self.num_units], initializer=w_init)
  147.       W_hh = tf.get_variable('W_hh',
  148.         [self.num_units, 4 * self.num_units], initializer=h_init)
  149.       # no bias, since there's a bias thing inside layer norm
  150.       # and we don't wanna double task variables.
  151.       concat = tf.concat(1, [x, h]) # concat for speed.
  152.       W_full = tf.concat(0, [W_xh, W_hh])
  153.       concat = tf.matmul(concat, W_full) #+ bias # live life without garbage.
  154.       # i = input_gate, j = new_input, f = forget_gate, o = output_gate
  155.       i, j, f, o = tf.split(1, 4, concat)
  156.       i = layer_norm(i, 'ln_i')
  157.       j = layer_norm(j, 'ln_j')
  158.       f = layer_norm(f, 'ln_f')
  159.       o = layer_norm(o, 'ln_o')
  160.       if self.use_recurrent_dropout:
  161.         g = tf.nn.dropout(tf.tanh(j), self.dropout_keep_prob)
  162.       else:
  163.         g = tf.tanh(j)
  164.       new_c = c*tf.sigmoid(f+self.forget_bias) + tf.sigmoid(i)*g
  165.       new_h = tf.tanh(layer_norm(new_c, 'ln_c')) * tf.sigmoid(o)
  166.     return new_h, tf.concat(1, [new_h, new_c])
  167. class HyperLSTMCell(tf.nn.rnn_cell.RNNCell):
  168.   # HyperLSTM, with Ortho Initialization,
  169.   # Layer Norm and Recurrent Dropout without Memory Loss.
  170.   def __init__(self, num_units, forget_bias=1.0,
  171.     use_recurrent_dropout=False, dropout_keep_prob=0.90, use_layer_norm=True,
  172.     hyper_num_units=128, hyper_embedding_size=4, hyper_use_recurrent_dropout=False):
  173.     self.num_units = num_units
  174.     self.forget_bias = forget_bias
  175.     self.use_recurrent_dropout = use_recurrent_dropout
  176.     self.dropout_keep_prob = dropout_keep_prob
  177.     self.use_layer_norm = use_layer_norm
  178.     self.hyper_num_units = hyper_num_units
  179.     self.hyper_embedding_size = hyper_embedding_size
  180.     self.hyper_use_recurrent_dropout = hyper_use_recurrent_dropout
  181.     if self.use_layer_norm:
  182.       cell_fn = LayerNormLSTMCell
  183.     else:
  184.       cell_fn = LSTMCell
  185.     self.hyper_cell = cell_fn(hyper_num_units,
  186.       use_recurrent_dropout=hyper_use_recurrent_dropout,
  187.       dropout_keep_prob=dropout_keep_prob)
  188.   @property
  189.   def input_size(self):
  190.     return self.num_units
  191.   @property
  192.   def output_size(self):
  193.     return self.num_units
  194.   @property
  195.   def state_size(self):
  196.     return 2 * self.num_units
  197.   def layer_norm(self, layer, scope="layer_norm"):
  198.     # wrapper for layer_norm
  199.     if self.use_layer_norm:
  200.       return layer_norm(layer, scope)
  201.     else:
  202.       return layer
  203.   def hyper_norm(self, layer, scope="hyper", use_bias=True):
  204.     num_units = self.num_units
  205.     embedding_size = self.hyper_embedding_size
  206.     # recurrent batch norm init trick (https://arxiv.org/abs/1603.09025).
  207.     init_gamma = 0.10 # cooijmans' da man.
  208.     with tf.variable_scope(scope):
  209.       zw = super_linear(self.hyper_output, embedding_size, init_w="constant",
  210.         weight_start=0.00, use_bias=True, bias_start=1.0, scope="zw")
  211.       alpha = super_linear(zw, num_units, init_w="constant",
  212.         weight_start=init_gamma / embedding_size, use_bias=False, scope="alpha")
  213.       result = tf.mul(alpha, layer)
  214.       if use_bias:
  215.         zb = super_linear(self.hyper_output, embedding_size, init_w="gaussian",
  216.           weight_start=0.01, use_bias=False, bias_start=0.0, scope="zb")
  217.         beta = super_linear(zb, num_units, init_w="constant",
  218.           weight_start=0.00, use_bias=False, scope="beta")
  219.         result = result + beta
  220.     return result
  221.   def __call__(self, x, state, timestep = 0, scope=None):
  222.     with tf.variable_scope(scope or type(self).__name__):
  223.       h, c = tf.split(1, 2, state)
  224.       w_init=orthogonal_initializer(1.0)
  225.       #w_init=tf.constant_initializer(0.0)
  226.       #w_init=tf.random_normal_initializer(stddev=0.01)
  227.       #w_init=None # uniform
  228.       h_init=orthogonal_initializer(1.0)
  229.       #h_init=tf.constant_initializer(0.0)
  230.       #h_init=tf.random_normal_initializer(stddev=0.01)
  231.       #h_init=None # uniform
  232.       h_size = self.num_units
  233.       x_size = x.get_shape().as_list()[1]
  234.       batch_size = x.get_shape().as_list()[0]
  235.       self.hyper_state = tf.zeros([batch_size, self.hyper_cell.num_units*2])
  236.       # concatenate the input and hidden states for hyperlstm input
  237.       hyper_input = tf.concat(1, [x, h])
  238.       hyper_output, hyper_new_state = self.hyper_cell(hyper_input, self.hyper_state)
  239.       self.hyper_output = hyper_output
  240.       self.hyper_state = hyper_new_state
  241.       W_xh = tf.get_variable('W_xh',
  242.         [x_size, 4*self.num_units], initializer=w_init)
  243.       W_hh = tf.get_variable('W_hh',
  244.         [self.num_units, 4*self.num_units], initializer=h_init)
  245.       bias = tf.get_variable('bias',
  246.         [4*self.num_units], initializer=tf.constant_initializer(0.0))
  247.       xh = tf.matmul(x, W_xh)
  248.       hh = tf.matmul(h, W_hh)
  249.       # split Wxh contributions
  250.       ix, jx, fx, ox = tf.split(1, 4, xh)
  251.       ix = self.hyper_norm(ix, 'hyper_ix', use_bias=False)
  252.       jx = self.hyper_norm(jx, 'hyper_jx', use_bias=False)
  253.       fx = self.hyper_norm(fx, 'hyper_fx', use_bias=False)
  254.       ox = self.hyper_norm(ox, 'hyper_ox', use_bias=False)
  255.       # split Whh contributions
  256.       ih, jh, fh, oh = tf.split(1, 4, hh)
  257.       ih = self.hyper_norm(ih, 'hyper_ih', use_bias=True)
  258.       jh = self.hyper_norm(jh, 'hyper_jh', use_bias=True)
  259.       fh = self.hyper_norm(fh, 'hyper_fh', use_bias=True)
  260.       oh = self.hyper_norm(oh, 'hyper_oh', use_bias=True)
  261.       # split bias
  262.       ib, jb, fb, ob = tf.split(0, 4, bias) # bias is to be broadcasted.
  263.       # i = input_gate, j = new_input, f = forget_gate, o = output_gate
  264.       i = ix + ih + ib
  265.       j = jx + jh + jb
  266.       f = fx + fh + fb
  267.       o = ox + oh + ob
  268.       i = self.layer_norm(i, 'ln_i')
  269.       j = self.layer_norm(j, 'ln_j')
  270.       f = self.layer_norm(f, 'ln_f')
  271.       o = self.layer_norm(o, 'ln_o')
  272.       if self.use_recurrent_dropout:
  273.         g = tf.nn.dropout(tf.tanh(j), self.dropout_keep_prob)
  274.       else:
  275.         g = tf.tanh(j)
  276.       new_c = c*tf.sigmoid(f+self.forget_bias) + tf.sigmoid(i)*g
  277.       new_h = tf.tanh(self.layer_norm(new_c, 'ln_c')) * tf.sigmoid(o)
  278.     return new_h, tf.concat(1, [new_h, new_c])
复制代码



上一篇:谷歌开放机器学习平台,惠及全球开发者
下一篇:nginx.conf配置文件5分钟上手
♀.}v__ 发表于 16-9-30 17:53:48
看完脑洞大开
回复 支持 反对

使用道具 举报

方垚 发表于 16-10-1 22:38:22
做为一只禽兽,我深感压力很大…
回复 支持 反对

使用道具 举报

瞳孔旳太阳 发表于 16-11-4 08:53:08
看过相关报道,楼下后来二了
回复 支持 反对

使用道具 举报

打· 发表于 16-11-12 09:55:42
做为一名新人,不敢在大声说话,也不敢得罪人,只能默默地顶完贴然后转身就走人。动作要快,姿势要帅,深藏功与名。
回复 支持 反对

使用道具 举报

告別昨天漣漪 发表于 16-11-14 15:03:49
楼主的存折丢我这里了!
回复 支持 反对

使用道具 举报

你要的只是一种虚荣 发表于 16-11-15 19:21:39
支持你哈..
回复 支持 反对

使用道具 举报

*滑动验证:
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

我要投稿

推荐阅读


回页顶回复上一篇下一篇回列表
手机版/CoLaBug.com ( 粤ICP备05003221号 | 文网文[2010]257号 )

© 2001-2017 Comsenz Inc. Design: Dean. DiscuzFans.

返回顶部 返回列表