We apply cross-lingual Latent Semantic Indexing to the Bilingual Document Alignment Task at WMT16. Reduced-rank singular value decomposition of a bilingual term-document matrix derived from known English/French page pairs in the training data allows us to map monolingual documents into a joint semantic space. Two variants of cosine similarity between the vectors that place each document into the joint semantic space are combined with a measure of string similarity between corresponding URLs to produce 1:1 alignments of English/French web pages in a variety of domains. The system achieves a recall of ca. 88% if no in-domain data is used for building the latent semantic model, and 93% if such data is included. Analysing the system's errors on the training data, we argue that evaluating aligner performance based on exact URL matches under-estimates their true performance and propose an alternative that is able to account for duplicates and near-duplicates in the underlying data.
翻译:我们在WMT16对双语文档对齐任务应用跨语言流流语义语义索引。 从培训数据中已知的英文/法文页面对面生成的双语术语文档矩阵的分解标准降级单数值使我们能够将单语文档映射成一个共同的语义空间。将每份文件放入联合语义空间的矢量的共性相似性两个变量结合了相应的URL之间的线条相似性,在不同领域生成了1:1的英法网页对齐。如果在构建潜在语义模型时没有使用内部数据,该系统将实现88%的回调,如果包含这些数据,则实现93%的回调。分析系统在培训数据上的错误,我们说,根据精确的URL匹配低于其真实性能来评价匹配性,并提出一种能够计算基础数据中重复和近重复性的替代方法。