VoLIP: Voce del LIP
PDF Print E-mail

A linguistic resource for the study of variation in the Italian language

 

Miriam Voghera

Francesco Cutugno

Claudio Iacobini

Renata Savy

 

VoLIP (Voce del LIP) is a linguistic resource which matches the audio signal files with the orthographic transcriptions of the samples of the LIP Corpus and allows the search of the corpus according to sociolinguistic as well as lexical and morpho-syntactic criteria.

VoLIP is a project financed by funds of the Italian Ministery of Education and Research and is expected to be released in 2012.

 

The LIP Corpus

The LIP Corpus was collected in the early 1990s to compile a frequency lexicon of spoken Italian (T. De Mauro, F. Mancini, M., Vedovelli, M. Voghera, Lessico di frequenza dell’italiano parlato, Milano Etaslibri 1993) and its size was tailored to produce a reliable frequency lexicon for the first 3,000 lemmas. Therefore, it consists of about 500,000 word tokens for 60 hours of recording.

The corpus represents diaphasic, diatopic and diamesic varieties.

As far as the diaphasic variation is concerned, texts are divided in five groups: A) face-to-face conversations; B) telephone conversations; C) bidirectional communicative exchanges with constrained turn-taking, such as interviews, debates, classroom interactions, oral exams, etc.; D) monologues, such as lectures, sermons, speeches, etc.; E) radio and television programmes. The texts in groups A and B belong both to formal and informal registers, while C, D, E texts are mainly recorded in public contexts, which select formal registers.

As far as the diatopic variation is concerned, the texts were collected in Milan, Rome, Naples and Florence. The first three cities were chosen according to their geographical position as well as to the number of inhabitants, as Rome, Naples and Milan are the most populated Italian cities. Florence was chosen because of its great relevance in the linguistic history of the Italian language.

While the number of samples is variable, the corpus presents a balanced total number of words per city and per diaphasic situation, as reported in Table 1.

 


Face-to-face

conversations

Telephone

conversations

Interviewes,

debates, meetings

Monologues

Radio/TV

Total

Milan

25,000

25,000

25,000

25,000

25,000

125,000

Florence

25,000

25,000

25,000

25,000

25,000

125,000

Rome

25,000

25,000

25,000

25,000

25,000

125,000

Naples

25,000

25,000

25,000

25,000

25,000

125,000

Total

100,000

100,000

100,000

100,000

100,000

500,000

 

Since the corpus was originally collected for lexical purposes, the recording conditions and the acoustic quality of the sessions differ. The quality scale extends from high levels of clarity of signal to low levels.

 

The structure of VoLIP

The VoLIP provides all the samples of the LIP corpus in wav files (Windows PCM, 22050Hz. 16 bit) in addition to:

1. session metadata in IMDI format;

2. the original orthographic transcription, already published in De Mauro et al. 1993, in TXT files.

 

The queries

Two kinds of queries are possible: A) by textual and register variables, as registered in metadata annotation; B) by lexical and morpho-syntactic criteria, as derived from both the frequency lexicon and the parts of speech parsing. The two kinds of queries can be crossed.

The metadata entries are the following: city, actor sex, genre, subgenre, subject, interactivity, planning type, social context, event structure, channel.

All the queries have as output the orthographic transcriptions matched with audio-files.

1. Metadata search results in all the texts presenting the requested features; metadata queries can be crossed with lexical and morpho-syntactic queries.

2. Lexical and morpho-syntactic search results in all the texts presenting the requested item (word form or lexeme) and the specific item within a preceding and subsequent portion of time. Each requested lexeme, word form or part of speech is provided with the frequency of occurrence per city and per register.

 

For more information about the composition of the corpus and how to query it, see

Miriam Voghera, Claudio Iacobini, Renata Savy, Francesco Cutugno, Aurelio De Rosa, Iolanda Alfano, 2014. VoLIP: a searchable Italian spoken corpus, in Complex Visibles Out There. Proceedings of the Olomouc Linguistics Colloquium: Language Use and Linguistic Structure. Edited by Ludmila Veselovská and Markéta Janebová. Olomouc: Palacký University, 2014, pp. 628-640.

Download the PDF of the paper

Attachments:
FileFile size
Download this file (AUDIO E TRASCRIZIONE.doc)AUDIO E TRASCRIZIONE.doc110 Kb
Download this file (Voghera et al. 2014_VoLIP.pdf)Scarica il PDF dell'articolo1772 Kb