Domain Taxonomy Learning from Text: The Subsumption Method versus Hierarchical Clustering

J de Knijff, Flavius Frasincar, Frederik Hogenboom

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

This paper proposes a framework to automatically construct taxonomies from a corpus of text documents. This framework first extracts terms from documents using a part-of-speech parser. These terms are then filtered using domain pertinence, domain consensus, lexical cohesion, and structural relevance. The remaining terms represent concepts in the taxonomy. These concepts are arranged in a hierarchy with either the extended subsumption method that accounts for concept ancestors in determining the parent of a concept or a hierarchical clustering algorithm that uses various text-based window and document scopes for concept co-occurrences. Our evaluation in the field of management and economics indicates that a trade-off between taxonomy quality and depth must be made when choosing one of these methods. The subsumption method is preferable for shallow taxonomies, whereas the hierarchical clustering algorithm is recommended for deep taxonomies.
Original languageEnglish
Pages (from-to)54-69
Number of pages16
JournalData & Knowledge Engineering
Volume83
DOIs
Publication statusPublished - 2013

Fingerprint

Dive into the research topics of 'Domain Taxonomy Learning from Text: The Subsumption Method versus Hierarchical Clustering'. Together they form a unique fingerprint.

Cite this