MARKUP CONVENTIONS
TURNS A turn is a speaker's uninterrupted speech, marked with ordered numerals. A turn also includes the
speaker id, which is linked to the background information of a specific speaker in the search engine.
UTTERANCES Utterances are sentence-like segments within a turn and are also marked with ordered numerals.
WORDS Words are one or more syllables/characters which have independent lexical status.
PUNCTUATION Punctuation marks used are: ,(comma), 。(period),
“”(quotation), 《》(book title), —— (sudden stop or sudden switch of topic).
OVERLAPPING SPEECH Overlapping speech occurs when one speaker begins to speak while another is still
speaking. Overlapping speech is marked with <overlap gid=”” oid=””></overlap>, “gid” stands for
the group number of overlapping; “oid” stands for the segment number in a single overlapping. Transcriptions of overlapping speech
are marked with ([]).
NON-SPEECH Events other than speech include laughing, crying, shouting, and advertisements in News
Commentary
and so on. The descriptions are included in () and marked with <desc></desc>.
NON-WENZHOU LANGUAGES (SWITCHED CODES) The corpus sometimes contains stretches of speech that are not
Wenzhou,
e.g., English, Japanese. These stretches of speech are included in {{}}, and marked with <mixed></mixed>.
UNCLEAR ELEMENTS Elements that are not heard clearly enough to be transcribed are marked with
<unclear></unclear>.
WRITTEN TRANSCRIPTION Where a character for a Wenzhou spoken form is available in simplified Unicode
Chinese, it is so transcribed. Characters or words that are not available in simplified Unicode Chinese are transcribed in
phonetic fonts in [[]], and marked with <phonetic></phonetic>.
Examples of transcription and XML markup with and without tags
|