An Hmm Based Pre Training Approach For Sequential Data

Much recent research highlighted the critical role of unsuper- vised pre-training to improve the performance of neural network models. However, extensions of those architectures to the temporal domain intro- duce additional issues, which often prevent to obtain good performance in a reasonable time. We propose a novel approach to pre-train sequential neural networks in which a simpler, approximate distribution generated by a linear model is first used to drive the weights in a better region of the parameter space. After this smooth distribution has been learned, the net- work is fine-tuned on the more complex, real dataset. The benefits of the proposed method are demonstrated on a prediction task using two datasets of polyphonic music, and the general validity of this strategy is shown by applying it to two different recurrent neural network architectures.
Even if deep learning systems reach state-of-the-art performance in several ma- chine learning tasks, their computational complexity is still a limit in many real- word scenarios. This issue has been partially tackled with the advent of new high performance parallel computing architectures, which exploit powerful graphic processors to speed-up learning algorithms [1]. However, the breakthrough that allowed to effectively train large-scale networks has been the introduction of an unsupervised pre-training phase [2], in which the network is trained to build a good generative model of the data, which can be subsequently refined using a supervised criterion (fine-tuning phase). The pre-training initializes the weights of the network in a region where optimization is somehow easier, thus helping the fine-tuning phase to reach better local optima. It might also performs some form of regularization, by introducing a bias towards good configurations of the parameter space [3]. Although the benefits of pre-training have been extensively investigated in the static domain (e.g., learning images encoded as fixed-size vec- tors), it is not yet clear how this approach should be extended to the temporal domain, where the aim is to model sequences of events. Dealing with time poses many challenges, because the temporal dependences limit the parallelization of the learning process. Despite recent advances in training recurrent networks [4], improving their convergence speed is therefore still challenging. A possible solu- tion is to pre-train only input-to-hidden connections, thereby ignoring temporal information (encoded by hidden-to-hidden connections) by considering each el- ement of the sequence as independent from the others [5].
In this paper we propose a different pre-training strategy, which is reminiscent of the idea of curriculum learning [6]. The rationale behind this approach is that complex problems should be learned by starting from simpler concepts and then increasing the difficulty level by gradually showing more complex training examples to the learning agent. To this aim, instead of using the...

