LEAFTOP ====== *Language Extracted Automatically from Thousands of Passages* A dataset prepared by Gregory Baker (gregory.baker2@hdr.mq.edu.au) This is an automatically-generated dataset, derived from translations of the four gospels of the New Testament. The software used for this extraction is available on github at https://github.com/solresol/thousand-language-morphology The scraping code used to fetch the translations is included in the git repository, but it takes several weeks to complete. Likewise, the code used to identify the most probably vocabulary took over a month to run on a large multi-CPU machine. It should run on any Linux or Unix-like machine (it seems to run successfully on OSX); it might work on Windows but has never been tried. ---------------------------------------------------------------------- There are 416 nouns in the gospels that appear twice or more in the same case, number and gender. The assumption is made that each of these nouns is translated consistently into a single word, single token or character sequence. (This of course is not universally true, but it works remarkably often.) For each each single word (or token, or ...) in the translated language, the probability that it is the translation of the Greek lemma is calculated using the binomial test based on their appearance in the corresponding verses. If no word (or token) is clearly identified as being the most probably translation, then the Greek lemma is ignored for that language; if one word (or token) is the most probable, it is recorded in this dataset. Since there is some ambiguity, this process produces around 300 nouns, together with a confidence score. For languages that have multiple translations of the bible, a "most common translation choice" is included as well. This dataset contains the output of doing this process across 1471 languages. Some metadata about each language (to disambiguate languages from different regions with the same) is also included, mostly derived from wikidata. Cross references to which translation of the bible it is derived from is included as well. This is useful to researchers looking at grammar morphology across a large number of languages and to researchers looking for comparative wordlists between related languages. ---------------------------------------------------------------------- There are various limitations to be aware of: - Languages (such as Khmer) that have an alphabet but do not have word breaks are not handled very well. Their translations will only be correct for words that are 4 letters or shorter; for a word in Khmer longer than this this process will generate some subsequence - There are concepts in Koine Greek that do not always translate into a single word or single character. "Chief priest" is one; this will often be translated incorrectly. - There will be mistakes in general. A low confidence score (a score of 2.0 or lower) has a very high chance of being an incorrect translation. The author has begun the process of evaluating how accurate this data is in general across a variety of languages. - The author believes that the use and distributions of these translations is legitimate fair dealing and fair use; but the underlying translations from which these translations were derived are usually works that are still under copyright, often by Wycliffe.