Täckström, Oscar (2005) An Evaluation of Bag-of-Concepts Representations in Automatic Text Classification. Masters thesis, KTH.
Full text not available from this repository.
Official URL: http://www.nada.kth.se/utbildning/grukth/exjobb/ra...
Automatic text classification is the process of automatically classifying text documents into pre-defined document classes. Traditionally documents are represented in the so called bag-of-words model. In this model documents are simply represented as vectors, in which dimensions correspond to words. In this project a representation called bag-of-concepts has been evaluated. This representation is based on models for representing the meanings of words in a vector space. Documents are then represented as linear combinations of the words' meaning vectors. The resulting vectors are high-dimensional and very dense. We have investigated two different methods for reducing the dimensionality of the document vectors: feature selection based on gain ratio and random mapping. Two domains of text have been used: abstracts of medical articles in english and texts from Internet newsgroups. The former has been of primary interest, while the latter has been used for comparison. The classification has been performed by use of three different machine learning methods: Support Vector Machine, AdaBoost and Decision Stump. Results of the evaluation is difficult to interpret, but suggest that the new representation give significantly better results on document classes for which the classical method fails. The representations seem to give equal results on document classes for which the classical method works fine. Both dimensionality reduction methods are robust. Random mapping, while being much less computationally expensive, shows greater variance.
|Item Type:||Thesis (Masters)|
|Additional Information:||Report number: TRITA-NA-E05150, 2005.|
|Deposited By:||Oscar Tackström|
|Deposited On:||09 Jul 2008|
|Last Modified:||18 Nov 2009 16:18|
Repository Staff Only: item control page