Document clustering is the process of organizing a particular electronic corpus of documents into subgroups of similar text features. Previously, a number of statistical algorithms had been applied to perform clustering to the data including the text documents. There are recent endeavors to enhance the performance of the clustering with the optimization based algorithms such as the evolutionary algorithms. Thus, document clustering with evolutionary algorithms became an emerging topic that gained more attention in the recent years. This paper presents an up-to-date review fully devoted to evolutionary algorithms designed for document clustering. Its firstly provides comprehensive inspection to the document clustering model revealing its various components and related concepts. Then it shows and analyzes the principle research work in this topic. Finally, it brings together and classifies various objective functions from the collection of research papers. The paper ends up by addressing some important issues and challenges that can be subject of future work.
The objective function (or fitness function) is the measure that evaluates the optimality of the generated evolutionary algorithm solutions in the search space. In clustering domain, the fitness function refers to the adequacy of the partitioning. Accordingly, it needs to be formulated carefully, taken into consideration that the clustering is an unsupervised process.
Different objective functions generate different solutions even form the same evolutionary algorithm. Presuming also that the fitness could either be a minimization or a maximization function. Moreover, the algorithm could be formulated with one or with multi objective functions. To sum up, "choosing optimization criterion is one of the fundamental dilemmas in clustering" .
As a result, the reviewed researches showed diversity in formulating or choosing the fitness functions. We seek to put all of these objective functions in a separate section to make it easy to compare and later develop.
We noticed that the content and web document clustering algorithms used mostly three groups of functions:
- Similarity / distance measure.
- Inter- / intra- clustering or both measures.
- Internal validity index measures.
On the other hand, the keyword/keyphrase clustering algorithms used either generated or statistical fitness functions.
A list of all objective functions for all presented EA-based researches is illustrated in table 1 below. Details of the composing parameters and/or equations are explained beneath it. Followed by describing the type of optimality (minimization/maximization) and the category of the fitness function. We arranged the functions in the same sequence appeared in previous sections.
Document Clustering was the research issue of increasingly various studies. After each stage of these research journeys, there were attempts to combine and classify these studies in reviews or survey papers. A number of these...