Topically-informed bilingually-constrained recursive autoencoders for statistical machine translation

Ruan, Zhiwei; Ji, Rongrong

doi:10.4310/CIS.2018.v18.n1.a3

Contents Online

Communications in Information and Systems

Volume 18 (2018)

Number 1

Topically-informed bilingually-constrained recursive autoencoders for statistical machine translation

Pages: 53 – 72

DOI: https://dx.doi.org/10.4310/CIS.2018.v18.n1.a3

Authors

Zhiwei Ruan (Fujian Key Laboratory of Sensing and Computing for Smart City, School of Information Science and Engineering, Xiamen University, Fujian, China)

Rongrong Ji (Fujian Key Laboratory of Sensing and Computing for Smart City, School of Information Science and Engineering, Xiamen University, Fujian, China)

Abstract

Learning high-quality phrase vector representations is one of important research topics in statistical machine translation (SMT). Towards phrase embeddings, most existing works mainly explore syntactic and semantic clues among internal words within phrases, which are however insufficient for representation learning due to the lack of context information. In this paper, we propose topically-informed bilingually-constrained recursive autoencoders for SMT, which substantially extends the conventional bilingually-constrained recursive autoencoders by exploiting latent topics in two ways. First, we introduce topical contexts to induce topical phrase embeddings. Second, word topic assignments from a latent topic model are leveraged to constrain the learning of word and topic embeddings, both of which form the base of the contextual phrase embedding learning in the proposed model. Experiment results on Chinese-English translation show that the proposed model significantly improves the translation quality on NIST test sets.

Full Text (PDF format)

This work is supported by the National Key R&D Program (No.2017YFC 0113000, and No.2016YFB1001503), Nature Science Foundation of China (No.U1705262, No.61772443, and No.61572410), Post Doctoral Innovative Talent Support Program under Grant BX201600094, China Post-Doctoral Science Foundation under Grant 2017M612134, Scientific Research Project of National Language Committee of China (Grant No. YB135-49), and Nature Science Foundation of Fujian Province, China (No. 2017J01125 and No. 2018J01106).

Published 7 June 2018