Learner corpus research is a young but vibrant new brand of research which stands at a crossroads between corpus linguistics, second language acquisition and foreign language teaching. Its origins go back to the late 1980s when academics and publishers started collecting data from foreign/second language learners with a view to advancing our understanding of the mechanisms of second language acquisition and/or developing pedagogical tools and methods that more accurately target the needs of language learners. At first limited to English as a Foreign Language, learner corpus research has begun to spread to a wide range of languages and as a result, the community group of learner corpus researchers is rapidly growing and diversifying. The First Learner Corpus Research Conference organized by the Centre for English Corpus Linguistics of the Université catholique de Louvain in September 2011 aimed to take stock of the advances made in the field in its over twenty years of existence. The resulting proceedings volume covers issues of learner corpus design, collection and annotation and contains reports on various aspects of (written and spoken) learner interlanguage – pronunciation, prosody, grammar, lexis, phraseology and discourse – as well as design of learner-corpus-informed tools. The volume also explores some of the ways in which learner corpus research could develop in the near future.
Katherine ACKERLEY
A comparison of learner and native speaker writing in online self-presentations:
Pedagogical applications
This paper investigates the language used by both learners and native speakers of English when introducing themselves to peers in an online community, and then goes on to discuss the pedagogical potential of the findings. A small corpus of self-presentations written by 220 first-year students majoring in English at an Italian university was compiled during the 2009-2010 academic year. The learner corpus was compared with a reference corpus consisting of self-presentations produced by native speaker students in higher education in English-speaking countries and posted on online forums. The paper first considers why it is important that language majors aim to write in a way that is appropriate to a given genre, rather than merely focusing on morpho-syntactic accuracy. It then focuses on aspects of divergence between learner and native speaker production, presenting some of the linguistic choices made by learners when presenting themselves to peers. It goes on to discuss how the creation of awareness-raising materials based on the analysis can enhance learning by directing students' attention towards the differences between their texts and those of native speaker students.
Theodora ALEXOPOULOU, Helen YANNAKOUDAKIS & Angeliki SALAMOURA
Classifying intermediate learner English: A data-driven approach to learner
corpora
We demonstrate how data-driven approaches to learner corpora can support Second Language Acquisition research when integrated with visualisation tools. We employ a visual user interface supporting the investigation of a set of automatically determined features discriminating between pass and fail First Certificate in English (FCE) exam scripts. We illustrate how the interface can support the investigation of individual features. The analysis of the most discriminative features indicates that the development of grammatical categories allowing reference to complex events, referents and discourse relations is a crucial property of the upper-intermediate level.
Margit BRECKLE & Heike ZINSMEISTER
L1 transfer versus fixed chunks: A learner corpus-based study of L2 German
This study deals with the question of what strategies Chinese L2 learners of German follow when starting a declarative sentence in German. The investigation is based on the ALeSKo corpus, a linguistically annotated learner corpus of written German. In previous studies, we observed that the L2 texts show a significant overuse of sentences that start with an information-structural function in comparison to comparable L1 texts. In this paper, we pursue an alternative line of explanation that explores whether the observed difference is due to an overuse of chunks in the L2 texts. We perform a chunk classification and also automatically detect all material copied from the title and the task description – a particular type of chunk. Our findings indicate that although L2 learners use chunks to a substantial degree, an overuse with respect to the beginnings of the sentences could not be confirmed.
Julian BROOKE & Graeme HIRST
Native language detection with 'cheap’ learner corpora
We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has issues when used directly for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that results in cross-validated performance that appears misleadingly high, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which have been previously avoided. We also investigate ways to do NLD that do not involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.
Marcus CALLIES & Ekaterina ZAYTSEVA
The Corpus of Academic Learner English (CALE) – A new resource for the study and assessment of advanced language proficiency
This paper introduces the Corpus of Academic Learner English (CALE), a Language for Specific Purposes learner corpus that is currently being compiled for the quantitative and qualitative study of advanced learners' written academic English. CALE is designed to comprise seven academic genres produced by learners of English as a foreign language in a university setting and thus contains discipline- and genre-specific texts. The corpus will serve as an empirical basis to produce detailed case studies that examine linguistic determinants of lexico-grammatical variation, i.e. semantic, structural, discourse-motivated and processing-related factors that influence constituent order and the choice of structural variants, but also those that are potentially more specific to the acquisition of L2 academic writing such as task setting, genre and writing proficiency. Another major goal is to develop a set of linguistic criteria for the assessment of advanced proficiency conceived of as "sophisticated language use in context".
Erik CASTELLO
Integrating learner corpus data into the assessment of spoken interaction in English in an Italian university context
This paper reports on ongoing research conducted at the University of Padua on the teaching and assessment of spoken interaction in English at level B2 of the Common European Framework of Reference for Languages (CEFR, Council of Europe 2001). The study is mainly based on a small learner corpus (about 18,000 words) composed of transcripts of interactions between second-year English as a Foreign Language (EFL) students recorded during assessment sessions. It presents the context of the interactions, the corpora used and the results of a series of investigations carried out into some pragmatic aspects of the interactions. The paper then explores how these findings can help us to flesh out the construct for ‘Discourse Management’ and, ultimately, to set more reliable scoring criteria.
Evelyne CAUVIN
Intonational phrasing as a potential indicator for establishing prosodic learner profiles
Prosodic profiles have been extensively used in forensics and language pathology. However, they are rarely used in second language acquisition as yet. The aim of this paper is to show how prosody can be used to define learner profiles, possibly their learning styles and their different cognitive abilities. It is our claim that different segmentation modes of utterances define different prosodic learner profiles and we aim to characterise these. We will show that prosodic profiles of French learners of English can be drawn on the basis of phrasing and that a cluster of prosodic properties corroborates this typology. Our analysis is first based on read speech and the subsequent classifications on recorded interviews of the same speakers. It reveals the limitations in the assessment phonological criteria the Common European Framework of Reference for Languages (CEFRL) (Council of Europe 2001) advocates and makes a good case for reconsidering them.
Meilin CHEN
Phrasal verbs in a longitudinal learner corpus: Quantitative findings
This study analyses Chinese learners’ use of phrasal verbs from a longitudinal perspective. Through a comparison of the learners’ output of phrasal verbs with that of two groups of native English speakers (American university students and British secondary school leavers), Chinese learners were found to be capable of producing an adequate number of phrasal verbs. Yet, they did not demonstrate appropriate choice of phrasal verbs. The longitudinal data reveal that the learners’ acquisition of phrasal verbs during their three years of study was not always linear. A considerable decrease in the number of phrasal verbs used in the students’ writing in their second year was noticed. No considerable increase in the use of phrasal verbs was observed at the end of their third year. Another important finding of this study is that the American students tend to use far more phrasal verbs than their British and Chinese counterparts.
Pieter DE HAAN & Monique VAN DER HAAGEN
The search for sophisticated language in advanced EFL writing: A longitudinal study
Even very advanced EFL writing tends to be less sophisticated than native writing. One of the problems seems to be finding the right collocations and the correct register. The aim of this article is to pinpoint what characterizes the development in very advanced Dutch EFL students’ written language production, more specifically the use of appropriate intensifiers. Compared to their native English speaking contemporaries, the Dutch students initially tend to use intensifiers that are found typically in spoken English, such as really and a bit, but these gradually disappear. Alternatively, as students progress, the use of the intensifiers so, quite, and rather, becomes more native-like. A qualitative analysis of a selection of essays written by four individual students shows that some students get more out of academic input than others.
Deise P. DUTRA & Tony Berber SARDINHA
Referential expressions in English learner argumentative writing
The aim of this paper is to report our findings of the investigation on lexical bundle types in learner argumentative writing. Our data consisted of the International Corpus of Learning English (ICLE), the Louvain Corpus of Native English Essays (LOCNESS), and Br-ICLE, the Brazilian sub-corpus of ICLE. Our classification followed the functional taxonomy proposed by Biber et al. (2004) and expanded by Simpson-Vlach & Ellis (2010). The research methodology included the extraction of 3-, 4- and 5-word bundles followed by manual and automatic categorization in broad categories (referential expressions, stance expressions and discourse organizing functions) as well as 18 specific subcategories (e.g. intangible and tangible framing attributes and quantity specification). Second, the most frequent categories in each corpus were identified. Third, we focused on the most frequent one: referential expressions. Fourth, the chi-square test, cluster analysis and ANOVA were used to detect significant differences across corpora. The subcategories that contributed the most to statistically significant differences across corpora were: specification of intangible framing attributes, identification and focus, and contrast and comparison. The results also show that there is more internal lexical variation of nouns in the intangible framing attribute bundles produced by native than non-native speakers. The conclusions are that referential expressions might need to receive more attention in pedagogical contexts so their discourse functions become more salient to learners.
Anna ESPUNYA
Investigating lexical difficulties of learners in the error-annotated UPF learner translation corpus
The aim of this article is two-fold. First, it describes the learner translation corpus developed at the Universitat Pompeu Fabra School of Translation and Interpreting (UPF-LTC). A learner translation corpus is a corpus of translations written by students; the UPF-LTC has two search configurations: as a bilingual, sentence-aligned, English-Catalan translation corpus and as a monolingual Catalan translation corpus. It has been annotated both with linguistic information and with error tags according to a set taxonomy of translation errors. The second aim is to illustrate the applications of the corpus for research into the types of translation errors involving lexical use such as false friends and deficient or imprecise lexical choices. The results are relevant not only for the didactics of translation but also for translation-oriented bilingual lexicography.
Michael FLOR & Yoko FUTAGI
Producing an annotated corpus with automatic spelling correction
This paper describes ConSpel, a software system for automatic detection and correction of non-word misspellings. We also present an ongoing research project for constructing an ETS (Educational Testing Service) Spelling Corpus. The corpus consists of essays written by native and non-native speakers of English to the writing prompts of TOEFL® and GRE® tests. Essays are annotated for misspellings by trained annotators, using a semi-automated methodology. An evaluation of the ConSpel system was conducted, using the data from the completed phase of the annotation project. The ConSpel system achieves above 95% accuracy in error detection. The evaluation also indicates that an advanced correction algorithm, which takes into account the local context of misspellings, achieves correction accuracy of 77% and consistently outperforms a baseline context-blind approach.
Costas GABRIELATOS
If-conditionals in ICLE and the BNC: A success story for teaching or learning?
This paper aims to contribute to the methodological toolbox of "pedagogy-driven corpus-based research" (Gabrielatos 2006), that is, research which is situated at the intersection of language description, pedagogical lexicogrammar, and pedagogical materials evaluation (e.g. Harwood 2005; Hunston & Francis 1998; Kennedy 1992; Owen 1993). The contribution of the present paper mainly lies in proposing a method of triangulating the corpus-based evaluation of lexicogrammatical information in English as a Foreign Language coursebooks, by way of examining a relevant corpus sample of learner written output.
Thomas GAILLAT
This and that in native and learner English: From typology of use to tagset characterisation
Learner corpus research is now faced with a multiplicity of tagsets. It is therefore difficult to carry out cross-corpus analysis due to the variety of tags used for each part-of-speech (POS). In this paper, we envisage this issue through a specific linguistic point. We propose a typology of uses in both native and non-native corpora. Various tagsets are analysed so as to measure the relevance of the linguistic information provided for this and that. Overall, a comparative analysis of this and that in tagsets is proposed and the benefits and flaws of manual fine-grained annotation versus automatic annotation are assessed. This study comes as a first step towards automated annotation of this and that in various corpora as this process would pave the way to corpus interoperability at POS level.
Francesca GALLINA
The Lexicon of Spoken Italian by Foreigners: A study on the acquisition of vocabulary by L2 Italian learners between measures of lexical richness and lexical fields
The aim of this paper is to present a corpus-based study of the acquisition of the vocabulary by learners of L2 Italian. The goal of the research is to study the lexical uses of non-native speakers and the processes of lexical acquisition underlying these uses, applying some measures of lexical richness and analysing the lexical fields of the corpus. The informants of the corpus were non-native speakers with different proficiency levels, learning Italian both in Italy and outside of it. The main results show how lexical competence develops above all quantitatively at the beginning and intermediate levels, as we...