Language Identification Of Individual Words With Joint Sequence Models

Within a multilingual automatic speech recognition (ASR) system, knowledge of the language of origin of unknown words can improve pronunciation modelling accuracy. This is of particular importance for ASR systems required to deal with code-switched speech or proper names of foreign origin. For words that occur in the language model, but do not occur in the pronunciation lexicon, text-based language identification (T-LID) of a single word in isolation may be required. This is a challenging task, especially for short words.
We motivate for the importance of accurate T-LID in speech processing systems
and introduce a novel way of applying Joint Sequence Models to the T-LID task.
We obtain ...view middle of the document...

For minority languages, reliable word lists of substantial size can be surprisingly difficult to obtain and text-based language identification (LID) systems must be able to generalise from fairly small corpora to be useful in practical systems. Much of the research in the T-LID field have been performed on running text (see Botha and Barnard \cite{botha2012factors} for an overview), but several studies have focused on identifying the language origin of short text samples. Good results have been obtained using conventional statistical methods, such as n-gram based SVMs (Support Vector Machines) classification~\cite{giwan, bhargava2010language} or n-gram based Na\"{i}ve Bayes (NB) classification~\cite{giwan}.

In the above work, text sequences of as long as 15 characters are still considered ''short''. When words are considered in isolation, the task becomes extremely challenging, with shorter words (3 or 4 characters) retaining very little language-discriminative information. SVMs still perform well in this task domain~\cite{giwan}, but they can be time-consuming to train, especially when fairly large training corpora are used and multiple language classification is required. Specific studies on words in isolation applied Na\"{i}ve Bayes \cite{giwan}, SVMs \cite{giwan, bhargava2010language} and compression techniques \cite{hategan2009language}.

At the same time, grapheme-to-phoneme (G2P) conversion techniques; aimed at predicting the pronunciation of a word from its orthographic form, have matured significantly during the past decade. Specifically, Joint Sequence Models (JSMs) have become a well-utilised method for pronunciation prediction~\cite{bisani2008joint}. (See~\cite{bisani2008joint, damper1998comparison, taylor2005hidden} for comprehensive reviews of G2P techniques).

In this work we investigate the applicability of JSMs to the Language Identification (LID) task, and analyse the comparative performance that can be obtained when applying JSMs, rather than the better-known SVM classifiers (which we use as our baseline classifier). Specifically, we consider a four-language South African task (Afrikaans, English, isiZulu and Sesotho) and a data set for which semi-comparable baselines are available. We describe how JSMs can be applied to the LID task; and demonstrate factors that influences the identification accuracy.

Joint Sequence Models (JSMs)~\cite{bisani2008joint} are based on the concept of ''graphones''. Each graphone consists of a sequence of graphemes linked to a sequence of phonemes modelled as a single unit. Both the graphone inventory and m-th order conditional...

