Corpus

Allikas: Minority Translate
Main page | Why use it? | Features | Video tutorials | Workflow | Languages | Development | Download | Text corpus | Text processing | Contact
Language: Eesti | English | Русский


Linguistic corpus
One of the aims of the software is to allow electronic resources to be gathered for small languages for further linguistic investigations.
For this reason, if you allow for it, Minority Translate stores some basic information about the edits that may not be preserved or may be difficult to access in a local Wikipedia on its servers, and makes it publically available for all investigators.
The information included
This information includes the time and id of the edit, the source and target languages used, skill level for the languages chosen by author, and whether the text was said to be translated exactly or adapted loosely based on the sources. It also includes basic usage data, such as the filters which alter how much of the text was available and the operating system used.
Language skills
The users of Minority Translate and collaborative editing platforms like wikipedia may have very different skill level in the languages involved. Some of us may be enthusiasts in a language that we don't know very well. Others may be fluent native speakers of a language, but will have very little experience with writing. Minority Translate has opted to use the Wikipedia language skill classification as specified in the extension Babel. This specifies the user language on a scale from 0 to 5 and gives native speakers a special status. The extension Babel allows your language skills to be easily demonstrated on your Wikipedia user page if you want.
Usage
The corpus files will also include a simple script that allows the materials stored on the server and the materials stored on the wikipedia to be compiled into a basic linguistic corpus, which can utilize the translations as a monolingual text corpus or as a complex of parallel corpora.
The raw data collected is currently available as a JSON file. Tools to process it will be included in the future.
Application
Compiling data on linguistic usage has many applications, most importantly in the field of language technology and language documentation.
Parallel corpora of languages of translation will in time may allow software to be created that can automatically translate between the languages in question or similar languages. [1] This allows information to be reached by a much larger circle of people.
For many small languages there is not nearly enough data available or enough researchers working on them to keep up with the diversity within a language or understand the diversity between languages. Recording texts in small languages, be they stored in Wikipedia or elsewhere, will allow researchers to better understand the speciality of each particular language and its role in all human languages. Contributing to this corpus of texts can greatly help us learn about ourselves.
  1. For an article on using Wikipedia to improve statistical Machine translation see here. Dan Tufis, Radu Ion, Stefan Daniel Dumitrescu (2013). Wiki-Translator: Multilingual Experiments for In-Domain Translations