The idea of text clustering long preceded the computer age: “Clustering is one of the most primitive mental activities of humans, used to handle the huge amount of information they receive every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries is an obvious example. Manual clustering was the only type of document clustering possible prior to the computer age. This circumstance may have influenced much clustering work that relied only on immediate intuitive knowledge of the world without making use of quantitative numerical methods. In other words, text clustering was usually performed in subjective ways that relied heavily on the perception, knowledge, and judgment of the researcher. With more and easier accessibility to electronic digital data in different disciplines and the power of computing data processing on one hand and the need for maintaining objectivity standards on the other, it has become ever more likely that such procedures must involve computational automated methods (Arabie et al., 1996) where human intuition and traditional organization methods are replaced by mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have witnessed a flourishing of the development of automated statistical clustering and classification systems for systematizing the inherent subjectivity in traditional text classification applications. It is this need for automated objective methodology that motivates our clustering of Hardy’s novels and short stores.
Clustering vs. classification
The two terms clustering and classification are extensively used throughout this thesis. The question that rises at this point is: are they synonymous or is there a distinction?
In order to answer this question, some overlapping concepts should be considered. Firstly, there is an overlap between the two terms text classification and text categorization. In information retrieval (IR) and text classification literature (Sebastiani, 2006; Svetlana, 2006; Taeho, 2006; Mirkin, 2005; Sebastiani, 2005a; Sebastiani, 2005b), the two terms are often used interchangeably. This thesis too uses them interchangeably. Secondly, there is a frequent confusion between the terms text clustering and text classification. While many studies (Janos and Balazs, 2007; Wang, 2007; Ozgur, 2006; Jain et al., 1999) use the two terms interchangeably, this thesis does not. The idea they share is that they are both concerned with grouping documents into clusters or groups. However, mechanisms for doing so are...