Standard Yiddish, the language of classical literature, is being described more or less systematically, but is just a small part of all Yiddish varieties. The question is how to present another varieties in the corpus.
First of all, spoken Yiddish differs from Standard Yiddish phonetically and grammatically, so it needs special treatment in the corpus to be analyzed by the parser. The easiest decision is to present dialectal texts in normalized form, paired with recordings, but ideally, the level of transcription should be added.
Besides that, there are secondary varieties of Yiddish. Haredi Yiddish is the most widespread variety in the modern world, and it also differs from Standard Yiddish grammatically as well as lexically. There are no spelling norms, and the common orthography lacks diacritics, which provides a lot of homonymy. Development of Haredi subcorpus is impossible without a transliterator, based on machine learning methods.
Yiddish also functions as heritage language - as many speakers are not as proficient in Yiddish, as they are in the language primarily communication (i.e. English or Hebrew). We suppose that this subject requires more investigation, and, therefore, a comprehensive quantitative tool allowing delivering such research would be of great use - a heritage (sub-)corpus. One of the difficult problems to be dealt with is marking of mistakes (or “non-standard language features”) in L2/heritage data.
The paper will also address the principles of selection of the texts that represent various varieties of Yiddish.