Wenzhou Spoken Corpus

温州口语语料库

Department of Linguistics, University of Alberta

Jingxia Lin and John Newman

MARKUP CONVENTIONS

TURNS

A turn is a speaker's uninterrupted speech, marked with ordered numerals. A turn also includes the speaker id, which is linked to the background information of a specific speaker in the search engine.

UTTERANCES

Utterances are sentence-like segments within a turn and are also marked with ordered numerals.

WORDS

Words are one or more syllables/characters which have independent lexical status.

PUNCTUATION

Punctuation marks used are: ,(comma), 。(period), “”(quotation), 《》(book title), —— (sudden stop or sudden switch of topic).

OVERLAPPING SPEECH

Overlapping speech occurs when one speaker begins to speak while another is still speaking. Overlapping speech is marked with <overlap gid=”” oid=””></overlap>, “gid” stands for the group number of overlapping; “oid” stands for the segment number in a single overlapping. Transcriptions of overlapping speech are marked with ([]).

NON-SPEECH

Events other than speech include laughing, crying, shouting, and advertisements in News Commentary and so on. The descriptions are included in () and marked with <desc></desc>.

NON-WENZHOU LANGUAGES (SWITCHED CODES)

The corpus sometimes contains stretches of speech that are not Wenzhou, e.g., English, Japanese. These stretches of speech are included in {{}}, and marked with <mixed></mixed>.

UNCLEAR ELEMENTS

Elements that are not heard clearly enough to be transcribed are marked with <unclear></unclear>.

WRITTEN TRANSCRIPTION

Where a character for a Wenzhou spoken form is available in simplified Unicode Chinese, it is so transcribed. Characters or words that are not available in simplified Unicode Chinese are transcribed in phonetic fonts in [[]], and marked with <phonetic></phonetic>.

Examples of transcription and XML markup with and without tags

Home