Abstract:
Identifying semantic similarity between two texts has many applications in
NLP including information extraction and retrieval, word sense disambigua-
tion, text summarization and type classi cation. Similarity between texts
is commonly determined using a taxonomy based approach, but the limited
scalability of existing taxonomies has led recent research to use Wikipedia's
encyclopaedic knowledge base to nd similarity or relatedness. In this the-
sis, we propose Hierarchical Semantic Analysis, a method which represents
semantics of a text in high dimensional space of Wikipedia concepts and
category hierarchies. We represent the meaning of any text excerpt as a
weighed vector of Wikipedia-based resources. To evaluate the similarity of
texts in this space, we compare the corresponding vectors using conventional
metrics (e.g. cosine). Compared with the previous state of the art, use of
Hierarchical Semantic Analysis(HSA) results in substantial improvements in
correlation of computed similarity scores with human judgements from r=
.873 to 0.901 for short sentence pairs and from r= .72 to 0.863 for paragraph
pairs.