Corpus Linguistics

来源:百度文库 编辑:神马文学网 时间:2024/10/03 06:46:09
This website is maintained by Michael Barlow. I would appreciate hearing about new links and links listed here that are no longer active. Please send email I would also like to solicit your help in building up the non-English corpus listings. Please let me know of any suitable corpora. I am slowly compiling abibliographic reference section that focusses on Corpus Linguistics and the use of corpora in language teaching. Suggestions for the bibliography and links to actual papers are welcome.
See also theParallel Corpora page
Texts: Corpora, Newspapers and News SitesChineseCzechDanishDutchEnglishEnglish-MiscellaneousEstonianEthiopicFrenchGaelicGermanHebrewItalianMalayNorwegianPolishPortugueseRussianScandinavianSpanishSwedishTurkishMiscellaneous
Learner corporaCorpus searchesWord lists and Stop listsSoftwareText analysisTaggers
Online papers, theses, etc. related to CL.Courses in Corpus LinguisticsBibliographyUseful Sites and Home Pages
Mandarin corpus Big 5 encoding
Czech National Corpus Experimental WWW access to the corpus.
Institute of the Czech National Corpus
News in Danish. Address has changed.
The Institute for Dutch Lexicology have several large corpora, which can be accessed for academic research purposes.
American National Corpus
Oxford Text Archive WEB siteOTA FTP FTP site. (Mirror ftp site for North America --OTA.) Good starting point. Includes British novels, Dickens, Trollope, etc. The Susanne Corpus is in this archive in the directory pub/ota/public/susanne. For background info, seeSusanne.OED Online
Project Gutenberg: (in English)
Some literary works such as "Moby Dick" and "Through the Looking Glass" are available electronically from Project Gutenberg.
Corpus of Spoken, Professional American-English The corpus is available commercially from Athelstan. There is a 50,000 word sample available online.
The Bookstack An experimental index to online books.
The Fairie Queene and other works by Edmund Spenser
British National Corpus. A large (100 million words) corpus of modern English (1990‘s). BNC World Edition is now available. See alsoBNC Indexer
International Corpus of English
COBUILD offers access to a large corpus for a fee. Also has a free demo.
Wellington Corpus of Spoken New Zealand English. CD-ROM. Written New Zealand English is also
Penn-Helsinki Corpus of Middle English
Lampeter Corpus Early Modern English
ICAME, Bergen. This is the ftp site. ICAME also produces an excellent CD-ROM containing Brown, LOB, London-Lund, and Helsinki corpora among others. Also the home of Corpora news-list. Also aweb-site.
The BergenCorpus of London Teenage Language
Corpus of Written British Creole
The TRAINS Spoken Dialogue Corpus
CCAT Archive Gopher site at U. Penn. A good site for classical, historical, and religious texts.
U.S. Government publications.
Voice of America News (Gopher)
CBC Canadian broadcasting. Includes sound files.
Time andTime Daily
Marx & Engels Online Library
World Religious Texts
Canterbury Tales Project An electronic Chaucer from Cambridge University Press.
Ftp site for Red Dwarf scripts
O.J. Simpson Trial Transcripts Another transcript source.O.J. trial transcripts AndAnother good source.
Estonian Law (in English)
Proper names Ftp site.
Presidential Inaugural Addresses Old link. All the president‘s addresses.
Russian novel Gopher
Progress: Family Systems research and Therapy Full text journal. (Phillips Graduate Institute?)
Estonian Corpus of Written Texts and inEstonian
Estonian Law (in Estonian!)
Thesaurus Linguae Aethiopicae
Louisiana French MOVED ??
French novels
News in French (Gopher) MOVED ??
Dictionnaire de l‘Acad閙ie fran鏰ise
Radio French Internationale
Old and Middle French
CURIA Project (medieval Irish texts)
Mannheimer Corpora A very large, growing, online German corpus archive (778 million words in August 2000). A copyright-free portion of the archive (379 million words in August 2000) is freely searchable. Invited guests have access to the whole archive. Partially tagged.
Project Gutenberg (German texts)
German newspapers -- tagged corpus with syntactic structure annotated.
German News: subscribe by sending an e-mail request to Today‘s news inGerman
Spoken Israel Hebrew Description of the project.
Comparative Indo-European Includes 200-item lexicostatistical lists for 95 Indoeuropean speech varieties, cognation judgments between the lists, lexicostatistical percentages, etc.
CORIS CORpus di Italiano Scritto beind developed at CILTA. Corpus will be available online and on CD-ROM at the end of 2000.
Italian literature (LiberLiber)
Italian newspaper
Malay Classical literature. Searchable online.
Oslo Corpus of Tagged Norwegian Texts
Polish Newspaper
Projecto Vercial Portuguese literature database
. The files disk1.taz disk2.taz are available in the directory ~ftp/pub/linguistics. The file cbmp.txt contains background information on the corpus. papers and useful info
News from Brazil
Russian literature
Russian foreign affairs articles I have not had much luck with this.
Vesti: A Canadian-Russian Newspaper.
Russian word list gopher.
Language Bank of Swedish Texts
Project Runeberg (Scandinavian classics)
Norwegian Law
South American oral and written texts available via ftp from
Spanish Syntax Research Group University of Santiago de Compostela. Information about ARTHUS (1.5 million words in modern Spanish) and syntactic database (BDS, 160.000 analysed clauses of ARTHUS). In progress: a medieval and classic Spanish corpus ("ARTHUS Medieval y Clasico).
"Maria" corpus Acquisition of Spanish.
Mexican Newspapers: El Nacional, La Jornada, etc.
Bank of Swedish
Turkish with an Australian flavour.
Telephone speech corpus:22 Language Corpus
Telephone speech corpus:Alpha-numeric corpus
Learner corpora Extensive information from Yukio Tono
Hungarian EFL Student Writing
ICLE - Brazilian Portuguese Sub-Corpus
COSMAS search Institut f黵 Deutsche Sprache, Mannheim, Germany.
IMS Stuttgart (Penn Treebank) search -- OLD LINK??
Cobuild Corpus Sampler
University of MichiganMiddle English Collection
Michigan Early Modern English Materials
Blake, Wordsworth, etc.Web concordance
Web-based analysis of Gutenberg texts by Ron Reck. See alsoCorpus Access at the University of Essex.
VISL Project Denmark. English and German corpora can be searched.
Concordance of Great Books
British National Corpus Simple search
LDC Online
French Stop list from
Stop lists and frequency lists for English, French and German. From Patrice Bonhomme.
Zipped file of n-grams from the Brown Corpus
Mike Scott‘s page contains several English wordlists.
COSMAS - A corpus analysis toolbox, online accessible since 1995, seeCOSMAS. 778 million words online, virtual corpus composition, complex query language, concordancing, collocation analysis etc.
MonoConc Pro. Commercial Windows concordance program (produced by me). See theAthelstan site.
MonoConc, a Mac/Windows concordance program that allows sorts (2R,1R,2L,1L) and provides simple frequency information. For information on availability, seeMonoConc.
ParaConc, a Mac/Windows concordance program for parallel texts. A version is available for free for research purposes (under license). For other uses, the single user price is $49.95. SeeParaConc.
Conc, a Mac concordance program, is available viaftp from SIL. Also available by anonymous-ftp from (/
Indiana University LETRSConc QuickGuide.
Free Text, a Mac concordance program, should be available from the U. of Michigan site. Also available from
HUM, developed by William Tuthill, is available by anonymous-ftp from (/
Perl Dan Melamed‘s perltools
Tact. Available viaftp from University of Toronto (
Indiana University LETRSTACT QuickGuide
World Wide Web implementation of TACT --TACTWeb. "TACTweb connects TACT to the World Wide Web-making a TACT TDB database accessible to the entire WWW community." See also Elisabeth Burr‘ssite.
LEXA Corpus processing Software version 6 (for DOS) is available viaftp. This is a suite of programs for tagging, lemmatization, word frequency counts, etc.
TextAnalyst Commercial software that produces a semantic network on the basis of text input. The company,Megaputer also produces a data mining tool PolyAnalyst.
Lexical FreenetWeb-based thesaurus
ShoeBox Fieldwork oriented program. Information available fromSIL.
VisualText A suite of commercialtext analysis tools.
Word Cruncher Info available fromWPT
WordSmith Mike Scott‘s WordSmithpage.
Paai‘s text utilities: A set ofutilities consisting of unix-scripts and c-programs for frequency-counts and lexical cohesion.
Windows CLAN
Eric Brill‘s program Ftp site.
TOSCA/LOB tagger for DOS. Downloadable.
Rank Xerox in Grenoble have an interesting site. It is possible to enter text in French, English, German etc. and get it tagged.
AMALGAM Email tagging, conversion of tagsets, ...
AUTASYS by Alex Chengyu Fang at UCL.
SemanTag A variant of Brill‘s Tagger??
TreeTagger Language-independent HMM tagger. Parameter files for English, French, German.
CRATER report. Discussion of a modified version of the Xerox Tagger.
Tagger overview by Linda Van Guilder
The Corpus Linguistics Group at the University of Birmingham has anExperimental email tagger-QTAG Texts can be sent via email to
The (LOB) CLAWS1tag set
CoreLex -- a tagset and database for semantic tagging based on WordNet
Michael Rundell The future of the corpus, and the corpus of the future Theses
Torbj鰎n Lager Thesis-A Logical Approach to Computational Corpus Linguistics
The BNC Handbook: Exploring the British National Corpus with SARA. Guy Aston and Lou Burnard. Edinburgh Textbooks in Empirical Linguistics.
Corpus Linguistics : Investigating Language Structure and Use Douglas Biber, Susan Conrad, Randi Reppen
An Introduction to Corpus LinguisticsGraeme Kennedy
Computer Corpus Lexicography Vincent B Y Ooi. Edinburgh Textbooks in Empirical Linguistics.
Corpus Linguistics Tony McEnery and Andrew Wilson. Edinburgh Textbooks in Empirical Linguistics.
Language and Computers: A Practical Introduction to the Computer Analysis of Language. Geoff Barnbrook. Edinburgh Textbooks in Empirical Linguistics.
Pattern Grammar A corpus-driven approach to the lexical grammar of English. Susan Hunston and Gill Francis Studies in Corpus Linguistics 4
Patterns and Meanings Using corpora for English language research and teaching. Alan Partington. Studies in Corpus Linguistics 2
Statistics for Corpus Linguistics. Michael Oakes. Edinburgh Textbooks in Empirical Linguistics.
Terms in Context Jennifer Pearson. Studies in Corpus Linguistics 1
Text and Technology In honour of John Sinclair. Mona Baker, Gill Francis and Elena Tognini-Bonelli (eds.) John Benjamins.
Tutorial: Concordances and Corpora Cathy Ball, Georgetown.
Methods and Tools for Large-Scale Corpus Linguistics
Eugene Charniak:Statistical course
Elisabeth Burr:Korpuslinguistik course
Tony Berber Sardinha:Corpus Linguistics courses: 1998-1999;2000
Mark Davies:History of the Spanish Language;Assignments and projects
Chris Brew:Statistical NLP;Probablistic modelling
Javier Perez-Guerra:English linguistics (written in Galician)
Bilge Say:Using Corpora for Language Research
Sabine Reich:Corpus course
References (by name)
References (by topic)
References compiled at UCREL (Computational Linguistics)
Centres and Departments
Corpus Linguistics at Birmingham University, England.
Center for Electronic Texts in the Humanities.
Centre for English Corpus Linguistics, Louvain
CTI Centre for Modern Languages Based in Hull, England. Newsletter, language software guide, info on language teaching.
Oxford Text Archive
Oxford University Language Centre
UCREL Site Lancaster University, England
Tuscan Word Centre
Other Useful Sites
African Languages Lexicon Project (ALLEX)
Alex gopher site
Alex allows users to find and retrieve the full-text of documents on the Internet.
American National Corpus
Annotation page at Upenn. Describes some 40 tools and formats for creating and managing linguistic annotations.
Athena Large e-text site
Books Online
CHILDES Parent-child interactions.
Alex Chengyu Fang Page Alex‘s page contains info on his various corpus tools
Tim JohnsClassroom Concordancing Page.
Collocations page
Concordancing page
Corpus Encoding Standards Coordinated by Nancy Ide
ECI/MCI Multilingual corpus information
Electronic Text Archive
English Language Corpora
the etext pages
European Language Resources Association ELRAcatalogue
Human Languages Page at Willamette.
ICAME andHong Liang Qiao‘s web-page
Index of electronic text projects
Internet Corpora Index
Literature in various languages. University of Virginia ETC. See alsoLe letterature del mondo
MATE Project Annotation of spoken corpora
Pittsburg U. Electronic Text project
SPIRE Text visualisation analysis
Survey of English Usage An interesting page.
Taglog project Logic-based Corpus Theory Development Environment.
WInter Web Internationalization Page. Multilingual WWW issues. MOVED ???
Send additions toMichael Barlow (