There are several research and procedures for classifying Arabic-language texts were based mostly on different environments and lack of dependence on a unified standard, unified data set, which led to the lack of precision in determining the most accurate technique in the classification, Arabic language processing is not saturated as that of other languages. Find the roots and stemmer of Arabic is an important phases towards conducting research on most effective applications of NLP Arabic so we have interest to apply algorithms to these phases. Arabic language has a complex structure which makes it difficult to integrate NLP research on it.
In this theses will be a study and analysis of the classification algorithms based on a unified environment and one dataset with the included challenges faced by these algorithms to demonstrate the effectiveness and accuracy and with a huge data set due to the expansion of data and the continuous increase in the internet.
There are several algorithms for the classification of texts which are used in the classification of texts in the group that have to do by helping to retrieve more quickly and give more accurate searches that for Arabic texts like K-NN ,DECISION TREES , Naive Bayse ,Random forest and others.
we used Diab datasets and the structure of the dataset:
The dataset has nine categories each of which contains 300 documents. Each category has its own directory that includes all files belonging to this particular category.
and we make two other collections of data set, the second dataset collections has nine categories each of which contains 600 documents. Each category has its own directory that includes all files belonging to this particular category, and third dataset has nine categories each of which contains 1200 documents. Each category has its own directory that includes all files belonging to this particular category.
then we have three collections of data set , the collection one consisting of 2700 file divided into nine categories everyone have 300 file, collection Two consisting of 5400 file divided into nine categories everyone have 600 file, and collection Three consisting of 10800 file divided into nine categories everyone have 1200 file, the categories are Art, Economy, Health, Law, Literature, Politics, Religion, and Sport.
we Applied four preprocessing method on original data , then we became have five parts of each group, the part one Original data, the part two removing stop words, punctuations, and diacritics, part 3 applying the light10 stemmer, part 4 applying Chen stemmer, part 5 applying Khoja algorithm for extracting the roots.
then we used seven...