Jack Rueter

Summary

ERME e-resource for Moksha-Erzya etc.: Corpora for language description and resource development

Introduction

These corpora materials identified by an .erme tag will contain open-source sentences with identifying headers and other metadata similar to the .conllu format used in the Universal Dependencies project. In distinction from UD practices not all sentences will have .conllu annotation in the lines between annotation headers. In addition to conllu annotation, there will be disambiguated and non-disambiguated .cg formating or simply the non-annotated text.

These morphosyntactical materials have been gathered for a better documentation of Erzya, Moksha, Komi-Zyrian, Skolt Sami and a variety of other languages.

We start out with Erzya

Adding One chapter from a short story by Mikhail Ivanovich Bryzhinski and one by Andrei Dmitrievich Kutorkin 2017-10-26.

These samples will be further enhanced with glosses and person tracking, i.e. possessor indices as well as subject and object reference in conjugation. The code 1-2:3.4,5 reads: part 1, chapter 2, paragraph 3, sentence 4, token 5.

Sources

Erzya and Moksha pieces in the lemmatized and morphosyntactically parsed ERME-s-v2 corpus.

Resources

Presently there are two versions of ERME on line. One is the original version consisting of only sentences with extended paragraph-size contexts, see ERME metashare. All searches were based on stings, i.e., consecutive keystrokes, but these were also accompanied by information about the literary work and its author, basically everything you would need for your bibliography. It is important that we can associate example sentences with their authors, because that way we can compare in linguistic research.

The latest version of ERME, which is now in test Korp for Erzya and Moksha, has retained the idea of showing important bibliographic data, but automated morphosyntactic annotation has also been added.