Text Clustering Essay

766 words - 3 pages

The idea of text clustering long preceded the computer age: “Clustering is one of the most primitive mental activities of humans, used to handle the huge amount of information they receive every day” (Theodoridis and Koutroubas, 2003: 398). The act of indexing long used in libraries is an obvious example. Manual clustering was the only type of document clustering possible prior to the computer age. This circumstance may have influenced much clustering work that relied only on immediate intuitive knowledge of the world without making use of quantitative numerical methods. In other words, text clustering was usually performed in subjective ways that relied heavily on the perception, knowledge, and judgment of the researcher. With more and easier accessibility to electronic digital data in different disciplines and the power of computing data processing on one hand and the need for maintaining objectivity standards on the other, it has become ever more likely that such procedures must involve computational automated methods (Arabie et al., 1996) where human intuition and traditional organization methods are replaced by mathematical and computational techniques (Golub, 2006; Golub, 2005). In this, recent years have witnessed a flourishing of the development of automated statistical clustering and classification systems for systematizing the inherent subjectivity in traditional text classification applications. It is this need for automated objective methodology that motivates our clustering of Hardy’s novels and short stores.
 Clustering vs. classification
The two terms clustering and classification are extensively used throughout this thesis. The question that rises at this point is: are they synonymous or is there a distinction?
In order to answer this question, some overlapping concepts should be considered. Firstly, there is an overlap between the two terms text classification and text categorization. In information retrieval (IR) and text classification literature (Sebastiani, 2006; Svetlana, 2006; Taeho, 2006; Mirkin, 2005; Sebastiani, 2005a; Sebastiani, 2005b), the two terms are often used interchangeably. This thesis too uses them interchangeably. Secondly, there is a frequent confusion between the terms text clustering and text classification. While many studies (Janos and Balazs, 2007; Wang, 2007; Ozgur, 2006; Jain et al., 1999) use the two terms interchangeably, this thesis does not. The idea they share is that they are both concerned with grouping documents into clusters or groups. However, mechanisms for doing so are...

Computer-Assisted Text Analysis

740 words - 3 pages Computational approaches are largely used in the variety of text applications such as feature selection and classification tasks because of their efficiency of dealing with huge amount of data. The discussion is concerned, however, with the applications of computational approaches to only literary texts in general and Hardy’s texts in particular. To my knowledge, there is no computer-aided thematic classification of the works of Thomas Hardy

Analyzing The Writings Of Thomas Hardy

1111 words - 4 pages The overall aim of this research study was to establish an objective clustering of Thomas Hardy’s prose fiction texts as a basis for better understanding the associations between the texts, and the development of an objective thematic analysis of Hardy’s corpus that can address the problems of replicability and objectivity in non-computational thematic classification of literary studies. To achieve this, this thesis used vector space clustering