Phonological Corpus of Czech

Go to: Lexical Sub-corpus | Textual Sub-corpus | Publications | Contact



Introduction

The Phonological Corpus of Czech is a bundle of phonologically trascribed databases (sub-corpora) of various Modern Czech word lists and texts. It was developed under the auspices of the project Issues in the Phonology of Word in Czech (13-15361P) supported by the Grant Agency of the Czech Republic (2013-2015), whose goal was to account for various aspects of the phonology of words in Modern Czech. In particular, it concentrated on the phonotactic aspect of Czech words (phoneme occurrence and phoneme frequency, phoneme combinations and the syllabic structure of words).

Description of the Corpus (last updated: 28/07/2020)

Abbreviations and symbols used in the Corpus and the evaluation files

Lexical Sub-corpus

A phonologically transcribed and annotated database of the Czech vocabulary (ie. of lemmas / dictionary entries) stored in a csv file (a comma-separated format file edittable e.g. by MS Excel). The transcription reflects the phonematic constituency of words and their syllabic structure ("syllabification"). The Corpus also includes an allophonic transcription showing an idealized pronunciation of a given lexical item. The main corpus is supplemented with several smaller lexical corpora.

The Lexical Sub-corpus ver. 1 currently contains:

The lexical items have been taken from the following major dictionaries of Czech included in the Database of Glossaries (except for Výslovnost spisovné češtiny, which was added separately):

Download the Lexical Sub-corpus, ver. 1 (zip/csv, ver. 1, last updated: 29/06/2016)

Quantitative analysis of the Lexical Sub-corpus

The Lexical Sub-corpus was quantitatively analyzed to obtain frequencies of various phonological units, in particular the phonological word. See the Description of the Corpus for the explanation of this notion.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values apply to phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

Additional lexical sub-corpora

The main Lexical Sub-corpus is supplemented with several lexical databases:

Names of municipalities and their parts (zip/csv, ver. 1.1, last updated: 29/06/2016)
· 15,051 names of the Czech municipalities and their parts existing by the end of 2013; it was analyzed and described in the paper Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Corpus since the publication of the paper.

· Complete analysis (zip, multiple csv files)

Most common male and female names and their hypocoristic forms (zip/csv, ver. 1, last updated: 29/06/2016)
· 5,724 items

Botanical names (zip/csv, ver. 1, last updated: 29/06/2016)
· 2,549 items

Zoological names (zip/csv, ver 1. last updated: 29/06/2016)
· 4,517 items

SSČ

Lexemes from Slovník spisovné češtiny, 4th edition, 2005 (zip/csv, last updated: 29/06/2016)

This sample (49,506 items) was analyzed and described in the papers Corpus-based analysis of the Czech syllable (2014), Kvantitativní analýza slabiky v českém lexikonu (2015), Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí (2015) (see Publications). Some minor corrections has been made in the Database since the publication of the papers.

· Complete analysis (zip, multiple csv files)

Textual Sub-corpus

The Textual Sub-corpus consists of a selection of phonologically transcribed Czech texts stored in xml files. The texts are mostly Czech novels in public domain (see here for the list of the currently included texts). Like in the case of the Lexical Sub-corpus, the transcription reflects the phonematic constituency of words (and sentences) and their syllabic structure. The Sub-corpus also includes an allophonic transcription showing the idealized pronunciation of the sentences. In addition, the transcription takes into account the neutral prosodic organization of words within sentences. The latter was automatically assigned on the basis of the rules proposed by Zdena Palková for the automatic TSS synthesis of Czech. See Description of the Corpus for more details.

The Textual Corpus ver. 1 contains:

Download the Textual Sub-corpus (zip/xml, ver. 1, last updated: 29/06/2016)

Quantitative analysis of the Textual Sub-corpus

The Textual Sub-corpus was quantitatively analyzed to obtain frequencies of various phonological units.

· Complete quantitative analysis (zip, multiple csv files)

Selected frequency tables (all values apply to phonological words):

Phonemes and allophones

Non-nuclear combinations ("consonant clusters")

Two-phoneme combinations of various types

Word structure

Phonotagm (syllable)

Publications

The following are the publications written under the auspices of the project or the works relying on data from the Corpus:

2014

· Bičan, Aleš // Word Phonology in Czech // Czech Language News (spring 2014)

· Bičan, Aleš // K pojmu fonologické slovo v češtině // Sophia Slavica, Sborník prací věnovaných PhDr. Žofii Šarapatkové k osmdesátým narozeninám (eds. Vít Boček - Bohumil Vykypěl), Brno: Tribun 2014, pp. 13-23 // download

· Bičan, Aleš // Nuclearity of /r/ and /l/ in Czech // New Insights into Slavic Linguistics (eds. Jacek Witkós - Sylwester Jaworski), Frankfurt am Main: Peter Lang 2014, pp. 21-33 // download // syllabicity test (referred to in the paper)

2015

· Bičan, Aleš // Distribution of vocalic quantity in Czech // Grazer Linguistische Studien 83, 2015, pp. 133-138 // download // supplementary data (referred to in the paper)

· Bičan, Aleš // Kvantitativní fonotaktická analýza názvů českých obcí a jejich částí // Slovo a slovesnost 76/4, 2015, pp. 243-264 // download // Appendix 1 // Appendix 2 // Appendix 3 // Appendix 4 (see also above the analysis of the corpus)

· Bičanová, Lenka - Bičan, Aleš // Nástin typologie fonologických změn na úrovni slova // Linguistica Brunensia 63/2, 2015, pp. 7-25 // download

· Bičan, Aleš // Kvantitativní analýza slabiky v českém lexikonu // Linguistica Brunensia 63/2, 2015, pp. 87-107 // download

· Bičan, Aleš // Corpus-based analysis of the Czech syllable // Beiträge der Europäischen Slavistischen Linguistik (POLYSLAV) 18 (eds. Guetiérrez Rubio, Enrique et al.), München - Berlin - Washington: Harrasowitz Verlag 2015, 26-36 // download

· Bičan, Aleš // Fonologický lexikální korpus češtiny a slabičná struktura českého slova // Bohemica Olomucensia 7/3-4, 2015, pp. 45-59 // download // supplementary data

2016

· Bičan, Aleš // Phonological properties of Czech verse and prose compared // Beiträge zum 19. Arbeitstreffen der Europäischen Slavistischen Linguistik (POLYSLAV) (eds. Guetiérrez Rubio, Enrique et al.). München - Berlin - Washington: Harrasowitz Verlag 2016, 27-37

2017

· Bičan, Aleš // Fonologický korpus češtiny // Nový encyklopedický slovník češtiny (eds. Karlík, Petr et al.). Online

· Bičan, Aleš // Slabikování // Nový encyklopedický slovník češtiny (eds. Karlík, Petr et al.). Online

· Bičan, Aleš // Fonématika // Nový encyklopedický slovník češtiny (eds. Karlík, Petr et al.). Online

· Bičan, Aleš // Fonotaktika // Nový encyklopedický slovník češtiny (eds. Karlík, Petr et al.). Online

2019

· Bičan, Aleš // Etymologie a fonologie: Případ Mathesiova fonotaktického pravidla // Vesper Slavicus. Sborník k nedožitým devadesátinám prof. Radoslava Večerky (ed. Petr Malčík), 13–31. Praha: Nakladatelství Lidové noviny, 2019, 13-31. // download

2020

· Bičan, Aleš // Syllabic Nasals in Czech // Etymologus (ed. Harald Bichlmeier et al.). Hamburg: Baar, 2020, 59–71. // download // Database of the Czech words with a syllabic nasal

· Bičan, Aleš // The Phonotactics of Syllabic Liquids in Czech Words of Foreign Origin // Zeitschrift für Slawistik 2020 65(2), 2020, 163–193 // download

· Bičan, Aleš // Kombinace konsonantu s vokálem v českých slovech cizího původu // To be published in Slovo a slovesnost // Consonant-Vowel combinations and their frequency [revised 10/9/2019]

Contact

Aleš Bičan
Ústav pro jazyk český AV ČR, v. v. i. // Czech Language Institute, Academy of Sciences of the Czech Republic
Veveří 97
60200 Brno
Czech Republic

email: bican@phil.muni.cz