The BitterCorpus is a collection of parallel English-Italian documents in the Information Technology (IT) domain where domain-specific terms have been manually marked and aligned. The documents are extracted from the GNOME and the KDE data collections. They contain 874 domain-specific bilingual terms in total.
GNOME Corpus: It contains 55 parallel documents extracted from the Gnome manual documentation (IT domain). Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 313 Italian and 282 English terms and 237 bilingual domain-specific terms.
KDE Corpus: It contains one parallel document extracted from the KDE manual documentation (IT domain), whereby the document is made of 100 lines of text.Three annotators, fluent in English and Italian, have been selected to annotate the documents with domain-specific terms. In total, they annotate 628 Italian and 628 English terms, and 637 bilingual domain-specific terms.
The creation of BitterCorpus was supported by the EU-funded project MateCat (ICT-2011.4.2-287688).
BitterCorpus is freely available for research purposes, and is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike (BY-NC-SA) license.
Click the button to get BitterCorpus (a request form must be filled).
Whenever making reference to BitterCorpus, please cite the following paper:
Mihael Arcan, Marco Turchi, Sara Tonelli and Paul Buitelaar. "Enhancing Statistical Machine Translation with Bilingual Terminology in a CAT Environment". In Proceedings of the Eleventh Biennial Conference of the Association for Machine Translation in the Americas (AMTA), Vancouver, Canada. 2014, pp. 54-68. (pdf)
For questions and support about BitterCorpus please contact: turchi [at] fbk [dot] eu