Preprocessing is actually a trail to improve text classification by removing worthiness information. In our work document preprocessing involve removing punctuation marks, numbers, words written in another language, normalize the documents by (replace the letter ("أ إ آ ") with (ا""), replace the letter (ء ؤ" ") with (""ا), and replace the letter("ى") with (""ا). Finally removing the stop words, which are words that can be found in any text like prepositions and pronouns. The rest of words are returned and are referred to as keywords or features. The number of these features is usually large for large documents and therefore some filtering can be ...view middle of the document...
In our work we used Term Frequency for feature selection methods where TF (t) =∑_(i=1)^m▒〖F(t,ci)〗 ( m is the number of classes; F(t,c) is the number of times the term t occurs in class ci). A word stems or roots were also used as feature selection in this work where words with the same stem or root are considered as one feature, and features with higher frequency are used.
Arabic Stemming Algorithms
Stemming techniques can be used in the Arabic text to reduce multiple forms of the word to one form (root or stem). Stemming can be defined as the process of removing any affixes (prefixes, infixes, or/and suffixes) from words to reduce these words to their stem or roots. A root can be defined as a term
Four Different methods of stemming algorithm are used in our work, three of them already applied by different authors (Khoja stemmer, light stemmer, root extractor) but I applied these stemmer on a different data set and finally a new stemmer technique is proposed in our work (wordnet stemmer). In this section a brief review of the forth stemming approaches is presented.
Applying Naviebase after removing stopword and normalization, then using khoja root stemmers. Khoja stemmer is the process of removing any affixes from words, and reducing these words to their roots. For example, stemming the English word computing produces the root compute. This is the same root produced by the word computation. After reducing words to their roots, these roots can be used in text classification (stemming helps in mapping grammatical variations of a word to instances of the same term). (refe koja)
Applying Naviebase after removing stopword and normalization, and then using Light stemmer. The main idea for using light stemming is that many word variants do not have similar meanings or semantics. However; these word variants are generated from the same root. Thus, root extraction algorithms affect the meanings of words. Light stemming aims to enhance the categorization performance while retaining the word meanings. It removes some defined prefixes and suffixes from the word instead of extracting the original root.
Light-stemming keeps the word's meanings unaffected. In this phase, we applied the light-stemming approach Here we note that light stemming maintains the difference between (الكاتبون الكتاب) which means "the book" and "the writers"respectively; their light stems are (كاتب
Applying Naviebase after removing stopword and normalization, then using root extractor stemmers. Root extraction algorithm handles only three-letter roots and it is efficient to use since more than 80% of Arabic words have three-letter roots and also the words that are frequently used in Arabic writings have usually three-letter roots.. Al-Shaalabi 2007 extracts word roots by assigning weight from 0-5, and ranks to the letters in a word. These letters weight is then multiplied by the rank, and the letters with the smallest product values represent the root of the word. (ref root...