Word-alignment Gold Reference
The benchmark contains 200 English-Italian sentence pairs extracted from texts of the JRC-Acquis Corpus, and their word-to-word alignments produced by two professional translators. These data were prepared within the EU-funded project MateCat, in order to evaluate the performance of a word alignment toolkit.
A couple of scripts for evaluating the alignment are also included.
The "README" file included in the distributed archive reports details and statistics about the benchmark and usage instructions of the evaluation scripts.
The creation of this benchmark was supported by the EU-funded project MateCat (ICT-2011.4.2-287688).
The benchmark is distributed under the terms of the Creative Commons Attribution - NonCommercial - NonDerivative (BY-NC-ND) license.
Resource is available here.
Whenever making reference to this resource, please cite the following paper:
Farajian, M. Amin, Nicola Bertoldi and Marcello Federico. “Online Word Alignment for Online Adaptive Machine Translation”. Proceedings of the Workshop on Humans and Computer-assisted Translation, co-located with the 14th Conference of the European Chapter of the Association for Computational Linguistics. Gothenburg, Sweden, 2014, pp. 84-92. (pdf)
For questions and support about this benchmark please contact: bertoldi [at] fbk [dot] eu