MULCE: Multi-level Canonicalization with Embeddings of Open Knowledge Bases
Published in WSDM, 2020
Recommended citation: Tien-Hsuan Wu, Ben Kao, Zhiyong Wu, Xiyang Feng, Qianli Song, Cheng Chen. MULCE: Multi-level Canonicalization with Embeddings of Open Knowledge Bases. Web Information Systems Engineering – WISE 2020. Lecture Notes in Computer Science, pages 315-327. Springer, 2020. https://dl.acm.org/doi/abs/10.1145/3336191.3371782
An open knowledge base (OKB) is a repository of facts, which are typically represented in the form of (subject; relation; object) triples. The problem of canonicalizing OKB triples is to map different names mentioned in the triples that refer to the same entity into a basic canonical form. We propose the algorithm Multi-Level Canonicalization with Embeddings (MULCE) to perform canonicalization. MULCE executes in two steps. The first step performs word-level canonicalization to coarsely group subject names based on their GloVe vectors into semantically similar clusters. The second step performs sentence-level canonicalization to refine the clusters by employing BERT embedding to model relation and object information. Our experimental results show that MULCE outperforms state-of-the-art methods.