Corpus construction

  • Manuel BARBERA (Turin, Italie)
    Complex lexical units and their morphosyntactic treatment in the Corpus Taurinense
    2000, Vol. V-2, pp. 57-70

    Corpus Taurinense (CT) is the POS tagged version of ItalAnt Corpus, an electronic corpus of Old Italian texts (between 1251 and 1300). In this article we aim to describe the approach followed in CT for the annotation of multiword units (MWU). MWU in our work is a set of two or more graphic words which receive (also) an overall POS tagging because this set of words is in paradigmatic relation with one word lexical unit with the same POS.Our POS tagging confirms that most of the Modern Italian compound conjunctions at that time were not lexicalised. The order of the components is already the Modern Italian order but they can still be interrupted by occasional elements.

  • Rabia BELRHALI (INPG-Grenoble)
    BdPholex: a phonetical and lexical database of spoken French
    1999, Vol. IV-1, pp. 75-78
  • Gabriel BERGOUNIOUX (Orléans)
    The Sociolinguistic study on Orleans (1966-1970)
    1996, Vol. I-2, pp. 87-88
  • Mireille BILGER (Perpignan)
    Corpus de portugais et d'espagnol
    1996, Vol. I-2, pp. 124-130
  • Christian BOITET (Grenoble 1)
    Corpus for the Machine Translation: types, sizes and connected problems, in relation to use and system type
    2007, Vol. XII-1, pp. 25-38

    It is important to realise that human translation is difficult and diverse, and that automation is needed not only by end users, but also by translators and Interpreters. Also, automation itself comes in many forms. After briefly describing computer tools for translators, we will concentrate on the linguistic and computer approaches to the automation of translation proper. This survey will yield an array of criteria for categorizing existing CAT systems, with brief examples of the state of the art. Finally, we present perspectives of future research, development, and dissemination.

  • Louis-Jean BOË (Grenoble)
    The material aspect of sound structures in language
    1996, Vol. I-1, pp. 41-54

    Do the major tendencies of phonological systems of languages depend on constraints of production and perception ? This problem has been studied in the framework of "substance oriented" linguistics, which was introduced simultaneously by Lindblom and Stevens in 1972. Various universal tendencies of phonological systems that might be explained by the characteristics of the sound structures and could be looked upon from an ontogenetical perspective, will be presented and discussed here. The characteristics and the predictability of vocalic and syllabic systems seem to be eminently suited for the study of this question on the basis of research carried out at the ICP.

  • Veerle BROSENS (Louvain, Belgique)
    The ELILAP and LANCOM projects
    1999, Vol. IV-1, pp. 89-95
  • Henri BÉJOINT (Lyon 2)
    Computer science and corpus lexicography: the new dictionaries
    2007, Vol. XII-1, pp. 7-23

    The dictionary evolved from medieval glosses that explained fragments of discourse in their contexts. Those fragments were later collected, then classified and reduced to their simplest forms, ie words. The most important aspect of that evolution from the gloss to the dictionary is that the fragment to be explained was decontextualized, extracted from discourse. The main objective of the dictionary is to give an image of the system. It is now possible to improve the dictionary in its role as a tool for explaining discourse. It cannot provide explanations that would be adapted to every single context, but it can give to the user a huge quantity of discourse, and provide explanations that would be more closely adapted to every occurrence or type of occurrence. Lexicographers would be well advised to investigate those new possibilities.

  • Nicoletta CALZOLARI (CNR-Pise, Italie)
    Standards for Linguistic Resources in Europe : the LE-EAGLES Project
    1999, Vol. IV-1, pp. 57-64

    The rapid growth of digitized linguistic information has brought forward the problem of its standardisation in view of a broader and better use, while at the same time the need of testing the various tools developed for this goal was felt. On the initiative of the European Commission, these questions have led to several research projects aiming at proposing useful standards for the whole of Europe, among which the EAGLES Project presented here.

  • Bart DEFRANCQ (Gand, Belgique)
    Corpus research at Ghent
    1996, Vol. I-2, pp. 93-94
  • Norbert DITTMAR (Berlin, Allemagne)
    Corpora of spoken and written German. Documentation on technical and organizational data
    1996, Vol. I-2, pp. 135-139
  • Marie-Laure ELALOUF (Cergy-Pontoise)
    The building-up and exploitation of corpora of texts written in schools
    2007, Vol. XII-1, pp. 53-70

    The first part of this article explains which methodological issues need to be examined in order to establish and transcribe a large corpus of texts written by pupils, along with their school context. The second part of the article states the various lines of epistemological questioning which led to a second research project, i.e. questions about how to define types of school writing as well as a corpus and context, and about the necessary links between those three elements. A variety of software programs was used to analyse corpora which were not in conformity with orthographical and stylistical standards. Such a use seems possible, joined with qualitative analysis.

  • Gunnel ENGWALL (Stockholm, Suède)
    French corpora made in Sweden
    1996, Vol. I-2, pp. 89-90
  • Michel FRANCARD (Louvain-la-Neuve, Belgique)
    The VALIBEL database
    1996, Vol. I-2, pp. 91-92
  • Benoît HABERT (Paris X-Nanterre)
    To tool up linguistics: from borrowing techniques to the meeting of knowledge
    2004, Vol. IX-1, pp. 5-24

    As such, linguistic research does not imply specific devices. However, linguistic descriptions and models would benefit from relying more often on NLP (Natural Language Processing) tools and resources and on computer science methods. The possible outcome depends on the chosen type of interaction between NLP, computer science and linguistics. A synergy between paradigms and methodologies would be more fruitful than a mere import of techniques.

  • Marie-Christine HAZAËL-MASSIEUX (Aix-en-Provence)
    Creole corpora
    1996, Vol. I-2, pp. 103-110
  • Stig JOHANSSON (Oslo, Norvège)
    Corpora for English language research
    1996, Vol. I-2, pp. 116-123
  • J.G. KRUYT (Leyde, Pays-Bas)
    Towards the Integrated Language Database of 8th-21st Century Dutch
    2000, Vol. V-2, pp. 33-44

    In the past decade, technology has had a major impact on the activities of the Institute for Dutch Lexicology (INL). The results include three electronic dictionaries, covering the period from 1200 up to 1976, and some linguistically annotated text corpora of historical and present-day Dutch. Three present-day corpora have been widely used not only for lexicography but also for many other purposes, since becoming accessible over the Internet in 1994. Advanced technology will have even more importance for a project recently started, the Integrated Language Database of 8th-21st Century Dutch, in which the dictionaries, lexica and a diachronic text corpus will be linked in a meaningful way. Parts of the database will be linked with comparable data collections at other institutes, thus creating a supra-institutional research instrument which will provide new opportunities for innovative research.

  • Jon LANDABURU (CNRS-Célia)
    Building a linguistic database for the Indo-american languages of Columbia: maps, glossaries, sound archives
    1997, Vol. II-1, pp. 83-90
  • Ann LAWSON (IDS-Mannheim, Allemagne)
    Corpus Linguistics at the Institut für deutsche Sprache
    1999, Vol. IV-1, pp. 79-82
  • Thomas Hun-tak LEE (Hong-Kong)
    CANCORP - The Hong Kong Cantonese Child Language Corpus
    1999, Vol. IV-1, pp. 21-30

    In this article the CANCORP (The Hong-Kong Cantonese Child Language) is presented, a corpus built in the spirit of the Child Language Data Exchange System (CHILDES, MacWhinney & Snow, 1985). After a brief description of the contents of CANCORP, the technical problems related to the transcription of the recordings of children in Chinese and in romanized characters are addressed. Next, a short assessment is made of the possibilities that CANCORP offers for the study of language development.

  • Isabelle LEROY-TURCAN (Lyon 3)
    The ACADEMIE base and its hypertexte: the eight editions of the Dictionnaire de l'Académie française (1694-1935) and the specifics of each edition
    1999, Vol. IV-1, pp. 47-54

    The ACADEMIE project aims at building an electronic database on the eight editions of the Dictionnaire de l'Académie française (DAF). As these eight editions cover the period from 1694 to 1932-35, this corpus presents interesting problems of diachrony and synchrony, and touches also on issues related to literature and culture. This way the DAF database is enriched by a whole range of hypertext links, allowing a dynamic dialogue between specialists and readers/consultants.

  • Eveline MARTIN (Nancy)
    The text corpora of INaLF. Elements for a catalogue
    1996, Vol. I-2, pp. 84-86
  • Shana POPLACK (Ottawa, Canada)
    The corpus of spoken French of Ottawa-Hull
    1996, Vol. I-2, pp. 95-97
  • Louise PÉRONNET (Moncton, Canada)
    Linguistic research on French spoken in Acadia
    1996, Vol. I-2, pp. 98-99
  • Laurent ROMARY (Nancy)
    The SILFIDE project. Towards an open access to French linguistic sources
    1996, Vol. I-2, pp. 77-83
  • Patrick SAINT-DIZIER (CNRS-Toulouse)
    Challenges and methods in building lexical semantic tools
    2002, Vol. VII-1, pp. 39-51

    This paper deals with the construction of lexical semantic resources for predicates, verbs and prepositions. We first raise questions about the theoretical perspectives and the methods to be applied. Next, we describe our resources: alternations, thematic grids and lexical conceptual structure representations. We conclude by some indications on the use of these resources in applications.

  • Emmanuel SCHANG (Orléans)
    CreolData: a lexical database on creole languages
    2005, Vol. X-1, pp. 65-76

    This paper presents CreolData, a multilingual lexical database concerning the Portuguese-based Creole Languages of Africa. In section 2, we describe the goals of the project. Section 3 is devoted to a short description of the languages of the database. We then give an overview of XML and the standards for electronic dictionaries, and focus on the macrostructure the microstructure (sections 4, 5 and 6). Finally, we give an outlook for future developments of this project (section 7).

  • José SOLER (UE)
    Lexical projects of the European Commission
    1997, Vol. II-1, pp. 79-81
  • Marianne STARREN (Max-Planck Institut, Allemagne)
    The European Science Foundation's Second Language Database
    1996, Vol. I-2, pp. 111-115
  • Céline VAGUER (Paris X-Nanterre)
    Creating a database: the different usages of 'dans' in marking simultaneity
    2004, Vol. IX-1, pp. 83-97

    The setting-up of a database from which a corpus and associated information (whether syntactic or semantic, etc.) are derived is not a natural undertaking in non-computational linguistics. This article sets out to present how such a technique can be exploited within the context of a research project focussing on the French preposition dans dans.

  • André VALLI (Aix-en-Provence)
    Grammatical labeling of corpora of spoken language: problems and perspectives
    1999, Vol. IV-2, pp. 113-133

    The use of transcription conventions that attempt to code the specific properties of speech, such as false starts, hesitations, and repetitions, and do not rely on the usual written punctuation, suggests that the grammatical tagging of transcribed oral corpora might be a very difficult undertaking. Developing speech-specific taggers, although desirable, would be a long-term project. In the experiment reported in this article, a spoken corpus was tagged using a system designed for written text, along with some appropriate pre-editing and post-editing programs. Quite unexpectedly, the results for speech were excellent, almost as good as those previously obtained for writing. This discovery allows us to foresee the rapid compilation of large tagged spoken corpora for French.

  • Nathalie VALLÉE (INPG-Grenoble)
    The UPSID database: its aims and its use
    1999, Vol. IV-1, pp. 7-19

    The search for universal tendencies in the languages of the world is undoubtedly a necessary axis for any theoretical perspective in linguistics. We present here UPSID (UCLA Phonological Segment Inventory Database, Maddieson, 1986 ; Maddieson & Precoda, 1990). This database contains phonological data which are genetically balanced and the description of which is harmonized. We implemented it in ICP to enrich typological researches on vowels, diphthongs and consonants. We have analysed UPSID with the help of an original methodology which not only confirm or make more precise some regularities already stressed on, but also brings up new data.

  • Miriam VOGHERA (Naples, Italie)
    Corpora of Italian
    1996, Vol. I-2, pp. 131-134
  • Piek VOSSEN (Amsterdam, Pays-Bas)
    WordNet, EuroWordNet and Global WordNet
    2002, Vol. VII-1, pp. 27-38

    In this article we aim to present the architecture of the database WordNet, organised in order to represent conceptual relations, and set up initially for the English language, as well as its extensions made under the name of EuroWordNet for seven other European languages.