OPUS: A Collection of Multilingual Parallel Corpora with Tools and Interfaces
The OPUS corpus is a growing collection of translated documents collected from the internet. The current version contains about 30 million words in 60 languages. The entire corpus is sentence aligned and it also contains linguistic markup for certain languages. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. OPUS is based on open source products and the corpus is also delivered as an open content package. We used several tools to compile the current collection. All pre-processing is done automatically. No manual corrections have been carried out. It offers the following as sentence alignment files in .txt, XCES (gzipped), and translation memory import files:
Downloads & Samples:
- EMEA – European Medicines Agency documents
- EUconst – The European constitution
- EUROPARL – European Parliament Proceedings
- OO – the OpenOffice.org corpus
- OpenSubs – the opensubtitles.org corpus
- KDE4 – KDE4 localization files (v.2)
- KDEdoc – the KDE manual corpus
- PHP – the PHP manual corpus
- SETIMES – A parallel corpus of the Balkan languages
- SPC – Stockholm Parallel Corpora
For example, the EMEA page is as follows: EMEA - source: http://www.emea.europa.eu: This is a parallel corpus made out of PDF documents from the European Medicines Agency. All files are automatically converted from PDF to plain text using pdftotext with the command line arguments -layout -nopgbrk -eol unix.
- 22 languages, 231 bitexts
- total number of files: 48087
- total number of tokens: 327799817
- total number of sentence fragments: 28110987
…and offers the following downloads (click to enlarge). Upper-right triangle: m = plain text files (MOSES/GIZA++), t = translation memories (TMX), language IDs = browse language files | Bottom-left triangle: ces = sentence alignments in XCES format, language IDs = zipped corpus file archive
Continue reading “OPUS: A Collection of Multilingual Parallel Corpora with Tools and Interfaces” »

