1. How to format a transcription?

  1. How to format a transcription?

A transcription in the CHAT format (MacWhinney, 2010), described in the following, is required for CALF to generate accurate results for your research. This page provides the general guidelines.

Before any of the following formatting procedures, you need to transcribe the speeches into plain texts. Then you can start to learn how to format your transcriptions into a file (here and after known as “text file”) to be processed by CALF.

1.1. Headers

At the beginning of a text file, there are a few lines of headers. These headers allow you to record crucial information about the participants or any important information about the research for your easy reference in the future.

See below for an example of the headers. You should format it exactly as it is, otherwise CALF may accidentally include this section as part of your data. You may wish to add or delete headers as pleased. However, follow these general formatting rules:

  1. Every line starts with an “@”
  2. Use Tab for every sections (even when it is not followed by any content)
  3. After the @Time Start (or any last line of the headers), leave two lines blank (without tab)

Sample Headers

@UTF8   <tab>

@Begin   <tab>

@Languages:  en

@Participants: STU Chan_Tai_Man Student

@ID:       en|STU|21;|female|relive|Student|

@Date: <tab>

@Location: <tab>

@Coder: <tab>

@Comment:    <tab>

@Time Start:   00:00:00

<blank>

<blank>

After these lines, the rest will be the actual data, which are represented in four-tier blocks, explained and demonstrated in Section 1.2 (next section).

 

1.2 Lines and Tiers

  • In order for CALF to read the file properly, stick to this four-tier block:
*ID:<tab>
%mor:<tab>
%snd:<tab>
%ID:<tab>
  • Note: Of these 4 lines, only three lines are manually coded, they include:
    • *ID: The pruned line. Utterance transcribed into words without dysfluencies or pauses or any grammatical marking.
    • %snd: The duration line. This line indicates the start and the end of the utterance in that AS-unit. (Refer to Section 2.2)
    • %ID: The main working line. This line includes all dysfluencies and pauses in the actual speech recorded, and syntactic marking. (For details of coding format for this line, refer to Sections 2,3, and 4)

The second line, %mor, is a line generated by CLAN that produces part-of-speech (POS) tagging to each word in the transcription. Section 1.2.2 offers instructions on how to generate this line.

 

1.2.1 Coding the Three Tier Block

Each AS-unit has its own block. In special occasions where a line must be left blank, just write the header with a period, to keep the format consistent. For example:

%PSK:        .

I suggest this because if nothing follows the header, when the file goes through the CHECK command, CLAN may change the speaker ID of this tier!!

  • Use up-slash ” | ”  at the end of an AS-unit. Use “:;:” to indicate sub-clauses, and “:::” to indicate main clauses
  • coordination of verb phases, are considered separate AS-unit unless they (a) are within the same tone unit or (b) are separated by a pause less than 0.5 sec. (Foster, Tonkyn & Wigglesworth, 2000)

*STU:      the man ask him to fall . |
(%mor:    this line is to be generated by CLAN later, so leave it blank at the beginning)
%snd: <02:00:81><02:03:46>
%STU:    er now {the man} * (1.19) the man ask him to * (0.67) to fall . |

but the same doesn’t apply to coordination of verbs, as verbs with no other clause element are not considered clauses:

%STU:    she’s very happy and jumping and leaping. |

  • “loose” sub-clauses and non-clausal supplement in final position are sometimes considered a new AS-unit:

%STU:    and the guy is now on the floor, struggling. (falling tone) |
%STU:    (1.24) as he is stuck.|

%STU:     er (0.57) and Mister_bean doesn’t realise this. |
%STU:     until now. |

  • repaired / unrepaired false-start. Consider this:
    so now uh ok now the mister oh the guy tell Mister bean to go far go further to take a good good pose position.

Situation 1: In the pause task, the speaker watches the video, tries to start an utterance, hesitates, pause the video, finishes the utterance. This is a repaired false-start. Code with reformulation (refer to Section 2):

 

%STU:    {so now uh} rpl {ok now the mister}rpl oh the guy tell Mister_bean ::: to {go far} ~ go further :;:m to take a good * good pose rpl position. |

 

Situation 2: There’s a long pause after “uh”. In the watch-and-tell task, during the 4.33 sec of silence, much could happen, while the speaker fails to utter whatever she wanted to say after “so now”. The fragment is one abandoned unit:

%STU:    so now uh #. |
%STU:    (4.33) {ok now the mister} rpl oh the guy tell Mister_bean ::: to {go far} ~ go further :;:m to take a good * good pose rpl position. |

  • resuming previous thread

%STU:    and {when he} * when he saw the ball is ~ uh was under the chair, and # (1.16) ~ chairs. |

%STU:    he put the ball down again, {and he’s} # getting ready to play.

%STU:    she sits down on bed.
%STU:    {she’s very} # pouting.

 

1.2.2 Producing the “%mor” (CLAN)

CLAN can help you to produce the second line, which is “%mor”. As for the other three lines, you need to produce them by yourself. The %mor line contains two CLAN functions, namely “mor” and “post”.

The “mor” command will automatically produce the part of speech tagging.

In order to use the “mor” command in CLAN, you need to follow these steps.

  • Produce a text file contains *ID, %snd and %ID.
  • Open CLAN, choose your target file in “working” section
  • Choose the location (“output”) to put your finished file
  • Choose “mor” in lib
  • Choose “ENG” in mor lib
  • Choose “mor” in the CLAN yellow button
  • Select file in button, and then find your text file
  • Once you selected your file, press done, in the main menu it will become “mor@”
  • Then press Run

 

The “mor” line produced so far may contain more than one possibility for a certain word. You will need the “post” command to disambiguate the tagging.

Use the name.mor file obtained from the last step to perform the “post” function.

Choose your target file in “working” section in CLAN

  • Choose the location (“output”) to put your finished file
  • Choose “post” in lib
  • Choose “post-wright” in mor lib
  • Choose “post” in the CLAN yellow button
  • Select File In button, and then find your text file
  • Once you selected your file, press done, in the main menu it will become “post@”
  • Then press Run

 

1.2.3 Checking the  “%mor” line

The %mor line is largely generated by CLAN (as demonstrated in the previous section). That being said, as CLAN does not produce tagging with 100% accuracy, the current section aims to give you simple guidelines on how to do manual tagging.

Follow these rules during the manual tagging:

  • Make sure every code starts with lower case letter
  • CALF does not count auxiliary and copula verb in lexical formality
  • If a code need further explanation, please use “:” for separation (e.g. pro:poss:det)

When checking the accuracy of CLAN-generated tagging (i.e. the %mor line), look for the following:

  • Wrongly tagged items e.g. pro|he n|works prep|in det|a n|university .
  • Question marks g. ?|school à n|school
  • “none” g. none|idea à n|idea

Here is a list of POS tagging codes and their explanations. The CALF recognizes exclusively these codes, so stick to the codes specified here.

Code Explanation
det Determiner
n Noun
adj Adjective
v Verb
adv Adverb
aux Auxiliary verb
cop Copula verb
n:gerund gerund
pro Pronoun
pro:obj Object
pro:poss:det Possession “’s”
inf Infinitives
prep Preposition
rel Relative clause
wh Wh- words, e.g. when
inter Interjection e.g. Oh
conj Conjunction e.g. but, and
num Numbers e.g. 1994, 16
subor Subordinators e.g. that, if
neg Negation e.g. not
part Particles
do&3s does
do&PAST did
do&PASTP done
do&PRES do
do&PRESP doing

 

*If lexical formality and lexical density are not your concerns, then you can leave the %mor line unchecked after the post routine. Otherwise, manually checking is necessary for accuracy.

For a template of a standard coded file, please download here.

The template includes the headings and regular blocks which can be expanded as needed. Simply fill in your data block by block then move on to producing the %mor line (in the next section). Always save as plain text before moving on to the next step.

CALF may not be able to function properly if the text file is not in the correct format. Check the following in your text file:

  • Use Tab for every sections
  • After the time start section, leave two lines blank
  • At the end of the text, remember to leave one line blank without Tab, and write “@End” on the next line
  • For the body of the text, all the lines need to Tab in
  • Be careful for the space behind the full-stop, there cannot be any space behind those four lines.

References

Foster, P., Tonkyn, A., & Wigglesworth, G. (2000). Measuring spoken language: A unit for all reasons. Applied linguistics, 21(3), 354-375.

MacWhinney, B. (2000). The CHILDES project: Tools for analyzing talk (3rd ed). Mahwah, NJ: Lawrence Erlbaum Associates