Language Identification Of Individual Words With Joint Sequence Models

1104 words - 5 pages

Within a multilingual automatic speech recognition (ASR) system, knowledge of the language of origin of unknown words can improve pronunciation modelling accuracy. This is of particular importance for ASR systems required to deal with code-switched speech or proper names of foreign origin. For words that occur in the language model, but do not occur in the pronunciation lexicon, text-based language identification (T-LID) of a single word in isolation may be required. This is a challenging task, especially for short words.
We motivate for the importance of accurate T-LID in speech processing systems
and introduce a novel way of applying Joint Sequence Models to the T-LID task.
We obtain ...view middle of the document...

For minority languages, reliable word lists of substantial size can be surprisingly difficult to obtain and text-based language identification (LID) systems must be able to generalise from fairly small corpora to be useful in practical systems. Much of the research in the T-LID field have been performed on running text (see Botha and Barnard \cite{botha2012factors} for an overview), but several studies have focused on identifying the language origin of short text samples. Good results have been obtained using conventional statistical methods, such as n-gram based SVMs (Support Vector Machines) classification~\cite{giwan, bhargava2010language} or n-gram based Na\"{i}ve Bayes (NB) classification~\cite{giwan}.

In the above work, text sequences of as long as 15 characters are still considered ''short''. When words are considered in isolation, the task becomes extremely challenging, with shorter words (3 or 4 characters) retaining very little language-discriminative information. SVMs still perform well in this task domain~\cite{giwan}, but they can be time-consuming to train, especially when fairly large training corpora are used and multiple language classification is required. Specific studies on words in isolation applied Na\"{i}ve Bayes \cite{giwan}, SVMs \cite{giwan, bhargava2010language} and compression techniques \cite{hategan2009language}.

At the same time, grapheme-to-phoneme (G2P) conversion techniques; aimed at predicting the pronunciation of a word from its orthographic form, have matured significantly during the past decade. Specifically, Joint Sequence Models (JSMs) have become a well-utilised method for pronunciation prediction~\cite{bisani2008joint}. (See~\cite{bisani2008joint, damper1998comparison, taylor2005hidden} for comprehensive reviews of G2P techniques).

In this work we investigate the applicability of JSMs to the Language Identification (LID) task, and analyse the comparative performance that can be obtained when applying JSMs, rather than the better-known SVM classifiers (which we use as our baseline classifier). Specifically, we consider a four-language South African task (Afrikaans, English, isiZulu and Sesotho) and a data set for which semi-comparable baselines are available. We describe how JSMs can be applied to the LID task; and demonstrate factors that influences the identification accuracy.

Joint Sequence Models (JSMs)~\cite{bisani2008joint} are based on the concept of ''graphones''. Each graphone consists of a sequence of graphemes linked to a sequence of phonemes modelled as a single unit. Both the graphone inventory and m-th order conditional...

Find Another Essay On Language identification of individual words with Joint Sequence Models

Wakalah with Waqf Models: Meaning of Waqf

1446 words - 6 pages rate there. In the simple words, the participants who are willing and agreed to contribute or invest into Takaful funds under ‘donation’ or Tawarru contract through credit into the Takaful funds for community uses or public benefits with ratio or portion that agreed by them. Modus operandi of Waqf-Takaful Concept In Waqf models in the Takaful contract, there are other models applied the principle of Tabarru or ‘donation’ contract and it is

The Language of Islamic Extremism towards an Automated Identification of Beliefs, Motivations and Justifications

1474 words - 6 pages This Critical review will discuss the article The Language of Islamic Extremism towards an Automated Identification of Beliefs, Motivations and Justifications. (2002) It will be argued that while the study exhibits depth of research, clearly defined corpus techniques and a nuanced area of discussion, the aims of the paper are not explicitly defined and the acknowledged limitations of the study leave the conclusions relatively underwhelming

Are Current Assessments Effective in the Identification of Specific Language Impairment in Bilingual Children?

2155 words - 9 pages combinations of subtests correlating with the presence or absence of SLI, as determined by the conclusions of the SLP panel. and gather the best possible outcome for the identification and differentiation of children with an impairment or children who just had a language difference. Gillam et al. (2013) found that both the original EpiSLI model, and the version modified for the study, over-identified SLI among bilingual children. In a significant

Identification and Characterization of Somatic Mutations Associated with Progression of Acute Myeloid Leukemia Having FLT3ITD

691 words - 3 pages My thesis dissertation titled “Identification and characterization of somatic mutations associated with progression of Acute Myeloid Leukemia having FLT3ITD and screening of small-molecule inhibitors in treatment of AMLFLT3ITD” at Cancer Science Institute (CSI) of Singapore, National University of Singapore (NUS) was the ideal experience to acquire skills in molecular cloning, cell culture, protein biochemistry, bioinformatics and exome

Key words: Chinese business students, cultural, language, communication. - University of southwales - essay

3424 words - 14 pages of life. What is the solution for boys and girls in the face of cultural differences. Cultural differences are due to the cultural characteristics of the country, and cultural differences between countries are a natural phenomenon. Chinese students should correctly observe the cultural differences between China and Britain. Key words: Chinese business students, cultural, language, communication. 1.0 Introduction 2 2.0 Literature Review 3 3.0

"Worrying about defining precise meanings of words is boring and pointless, knowledge becomes swamped by language."

766 words - 3 pages understood, then why bother with small differences in language? These affirmations, from both sides, cannot be made without the appropriate consideration to several cases and problems this might cause.To find a suitable answer to these questions, several areas of knowledge should be considered.One of them is science. It is very important in science that words are as accurate as possible, if not this might cause researches to be unproductive or even

Explain with reference to the relevant experimental evidence the main models of pattern recognition. Jonesy Smith

2048 words - 8 pages Explain with reference to the relevant experimental evidence the main models of pattern recognition.Adaptation of Sperling's model of Information processing explores the third process after Sensory Input, Pattern Recognition. Pattern recognition is the process by which we identify the various stimuli which have been encoded by our sensory systems. Evidence of the processes of how individuals assess stimuli is determined by establishing the main

With reference to three films, analyse how the meaning of a films narrative and genre are communicated through its title sequence

1552 words - 7 pages In the 1950s, the movie and broadcast design industries incorporated traditional graphic design with the dynamic visual language of cinema. Today, the creation of film titles and television graphics are mainly created by motion graphic designers. The first pictures that the viewer experiences is a film’s opening titles. Opening titles have grown as a style of experimental filmmaking in motion pictures, since the 1950’s. In films, the opening

Written language is an important element in the communication system, where oral language is strengthened by the support of written words. Ljungda

625 words - 3 pages language to be exposed to. Children provided with a range of written contexts develop a clear understanding of written language, therefore able to convey their thoughts with confidence. One disadvantage of written language is the inability to negotiate on a subject where the best form of negotiation is oral communication where words and tone of voice can be interpreted. Written language come in diverse forms, even within the same culture

The problem with determinism and the benefits of Taylor's theory of agency. 900 words. Bibilogeraphy

922 words - 4 pages first necessary to examine how it deals with the causation of actions. If an individual is relatively free in his decision making it follows that the individual agent can be considered a cause for the resulting action. For example, if I move my hand then the obvious cause of the motion is me and not some infinite series of causes. The lack of such a sequence of causes, unlike the one put forward by determinism, is an advantage for it allows the

Expressive Language Development Of Children With Down Syndrome

2936 words - 12 pages having Development of Language 5 parents of children with Down syndrome complete a MacArthur CDI/Words and Sentences test form and a vocabulary development history form. The forms consisted of a checklist of the child's expressive vocabulary such as categories, parts of speech, how children use words, word combinations, sentences, and grammatical usage.From evaluating the parents' responses they were able to conclude there was significant

Similar Essays

Language And Literature Because All Literature Is Created With Words,

1002 words - 4 pages language and literature Because all literature is created with words, the medium of literature is language. Not all combinations of words, however, result in literature. Literary combinations are differentiated from the enormous mass of casual discourse by some filtering device or set of rules. These words then pass into the permanent stock of preserved sounds or texts, forming the literary tradition of the group that produced them. One must

Identification Of Cuminoids With Anticancer Activity

1584 words - 6 pages the active ingredient(s) responsible for the observed biological activities [9-11]. Identification of the cumin fruit phytochemicals with anticancer and/or anti-inflammatory properties will not only help our understanding of the mechanism of their action but also the influence of this dietary ingredient on health. Hypothesis Based on the knowledge of the biological effects of cumin fruit described above, I hypothesize that cumin fruit contains

Knee Pain Associated With Misalignment Of The Knee Joint

1618 words - 6 pages Knee Pain Associated with Misalignment of the Knee Joint Introduction & Background I chose to write about knee pain and the misalignment of the knee joint, because it is something that I suffer from on a daily basis. In the last 11 years I have had three knee surgeries I have had part of my medial meniscus taken out and then followed it seven years later with the complete removal of the lateral meniscus. Due to my very valgus

Siemens Group: Analysis Of Technology Management Using Relevant Theories/Models & Identification Of Current Issues

8917 words - 36 pages management and business administration and economics, emphasized that "Siemens looks for rising executives with a specialty, such as background in math or engineering, a second language or industry expertise. What I want to see is an individual who has the passion to do a deep drill to understand something to the very bottom" .Integrating technology aspects into the M&A decision making process has been performed in Siemens' series of more than 10