Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Nilupulee Nathawitharana, Damminda Alahakoon, Sumith Matharage

Abstract


Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.

Keywords


Term overlap; Growing Self Organizing Map; Hierarchical clustering; Text document clustering

Full Text:

PDF


DOI: http://dx.doi.org/10.3127/ajis.v19i0.1180

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Creative Commons License
ISSN: Online: 1326-2238 Hard copy: 1449-8618
This work is licensed under a Creative Commons Attribution-NonCommercial Licence. Uses the Open Journal Systems. Web design by TomW.