語料庫語言學

語料庫語言學（英語：corpus linguistics）是基於語言運用的實例（即語料庫）的語言研究。語料庫語言學可以對自然語言進行語法與句法分析，還可以研究它與其他語言的關係。語料庫最初由手工完成，而現在主要是由電子計算機自動完成。

語料庫語言學家相信，可靠的語言分析需建立在新鮮的語料、自然的語言環境，和最小的實驗干擾之上。在語料庫語言學中，語料標註的意義眾說紛紜，從約翰·辛克萊(John McHardy Sinclair)^[1]主張最少量的標註，並允許文本「為自己說話」，到「英語用法調查組」（設在倫敦大學學院）^[2]鼓勵更多的標註，並認為它是通向更完備和嚴謹的語言理解的道路。

歷史

現代語料庫語言學的一個里程碑是亨利·庫切拉（英語：Henry Kucera）和W.納爾遜弗朗西斯在1967年出版的《當代美語的計算分析》（Computational Analysis of Present-Day American English）一書。該項工作基於對布朗語料庫（英語：Brown Corpus）的分析，布朗語料庫是一個精心編制的美國英語語料庫，規模約有一百萬詞次。庫切拉和弗朗西斯將這些語料用於各種計算分析，獲得了豐富和多樣化的成果，該成果結合了語言學、語言教、心理學、統計學、和社會學元素。另一關鍵出版物是1960年倫道夫·夸克（英語：Randolph Quirk）的《當代英語語法》（Towards a description of English Usage）^[3]，在這本書中他介紹了「英語用法調查」項目（The Survey of English Usage）。

此後不久，波士頓出版商霍頓·米夫林（英語：Houghton Mifflin Harcourt）邀請庫切拉為其新的美國傳統英語字典提供百萬詞次，三線引文的來進行詞典編纂。《美國傳統英語字典》創新地將規定性元素（應如何使用語言）和描述性元素（語言實際上是如何被使用）結合在了一起。

其他出版社紛紛效仿。英國出版商柯林斯COBUILD單語學習詞典，就是為非英語母語者學習英語而出版的，它使用了「英語銀行」（Bank of English）語料庫。「英語用法調查」語料庫被用於由夸克等人編著的《綜合英語語法》（A Comprehensive Grammar of the English Language）中。

布朗語料庫也催生了類似的語料庫：LOB語料庫（Lancaster-Oslo-Bergen Corpus，20世紀60年代英國英語），科爾哈帕（Kolhapur，印度英語），惠靈頓（Wellington，新西蘭英語），澳大利亞英語語料庫（Australian Corpus of English，澳大利亞英語），皺眉語料庫（Frown Corpus，20世紀90年代初，美國英語），以及FLOB語料庫（FLOB Corpus，20世紀90年代，英國英語）。其他語料庫包括國際英語語料庫（International Corpus of English），和英國國家語料庫（British National Corpus，收集了1億詞次的口頭和書面語料，在20世紀90年代時由出版商、牛津大學、蘭卡斯特大學和大英圖書館創建）。至於說到當代的美國英語，現已有了美國國家語料庫（英語：American National Corpus），以及可以在線訪問的4億多詞次的美國當代英語語料庫（英語：Corpus of Contemporary American English，1990年創建）。

第一個電腦轉錄口語語料庫，建於1971年蒙特利爾法語項目(Montreal French Project)，^[4]有一億詞次，這一項目還啟發了夏娜·帕普拉克(Shana Poplack)建立了規模更大的渥太華-赫爾地區法語口語語料庫（英語：Corpus of spoken French in the Ottawa-Hull area）。^[5]

語料庫除了收集現存語言，也收集古代語言。比如20世紀70年代建立的希伯來文聖經的安徒生福布斯數據庫（英語：Andersen-Forbes database of the Hebrew Bible，數據庫的每個子句的語法分析都使用了多達七級語構的圖表，每一部分都標註了七個方面的信息。^[6]古蘭經阿拉伯語語料庫（英語：Quranic Arabic Corpus）是古典的阿拉伯文《古蘭經》的標註語料庫。它包含多層次的標註，包括形態分割，詞性標註，以及使用依存語法進行的句法分析。^[7]

方法

語料庫語言學已經有了一大批研究方法，這些研究方法都試圖找到從數據到理論的解決方案。瓦利斯和尼爾森^[8]最先介紹了他們的3A觀點（英語：3A perspective）：注釋（英語：Annotation），抽象（英語：Abstraction）和分析（英語：Analysis）。

注釋包括語料的數據庫方案。注釋可能包括結構標註，詞性標註，句法分析和其他形式。
抽象包括該方案在理論上的啟發式模型或數據集中的翻譯（映射）。抽象通常包括面向語言學家的定向搜索，但也可能包括句法研究者的句法規則學習。
分析包括統計學探測，操縱和對數據集的歸納概括。分析可能包括統計學評估，規則庫優化和知識探索方法。

如今大多數詞彙語料庫採用詞性標註（英語：part-of-speech-tagged）。然而，即使是採用未標註語料的語料庫語言學家也無疑會使用一些方法來從句子中隔離出他們感興趣的詞。在這種情況下，注釋和抽象在詞彙搜素中結合起來了。

發布標註語料庫的優點是其他用戶可以在語料庫中進行研究與實驗。語言學家與其他相關人士就可以利用語料庫來工作了。通過數據共享，語料庫語言學家能將語料庫視為語言探討的核心，而不是知識的源泉。

註釋

^ Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82) . Berlin: Mouton de Gruyter. 1992.
^ Meurman-Solin, Anneli; Nurmi, Arja. Annotation, Retrieval and Experimentation. Annotating Variation and Change. Helsinki: Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. 2007 [2021-10-16]. OCLC 780136367. （原始內容存檔於2021-10-19）（英語）.
^ Quirk, R. 'Towards a description of English Usage', Transactions of the Philological Society . 1960. 40–61.
^ Darnell, Regna. Canadian languages in their social context. Carbondale: Linguistic Research. 1979 [2021-10-16]. ISBN 978-0-88783-003-7. OCLC 257958435. （原始內容存檔於2021-10-19）（英語）.
^ Poplack, S. The care and handling of a mega-corpus. In Fasold, R.& Schiffrin D. (eds.) Language Change and Variation , Amsterdam: Benjamins. 1989. 411–451.
^ Andersen, Francis I; Conrad, Edgar W; Newing, Edward G. Perspectives on language and text: essays and poems in honor of Francis I. Andersen's sixtieth birthday, July 28, 1985. Winona Lake, Ind.: Eisenbrauns. 1987 [2021-10-16]. ISBN 978-0-931464-26-3. OCLC 14588192. （原始內容存檔於2021-10-19）（英語）.
^ Dukes, Kais; Atwell, Eric; Habash, Nizar. Supervised collaboration for syntactic annotation of Quranic Arabic. Language Resources and Evaluation. 2013-03, 47 (1): 33–62. ISSN 1574-020X. doi:10.1007/s10579-011-9167-7 （英語）.
^ Wallis, S. and Nelson G. 'Knowledge discovery in grammatically analysed corpora'. Data Mining and Knowledge Discovery , 5 : 307–340. 2001.

參考文獻

期刊

致力於語料庫語言學的國際同行評審期刊

Corpora（頁面存檔備份，存於網際網路檔案館）
Corpus Linguistics and Linguistic Theory（頁面存檔備份，存於網際網路檔案館）
ICAME Journal（頁面存檔備份，存於網際網路檔案館）
International Journal of Corpus Linguistics（頁面存檔備份，存於網際網路檔案館）

書籍

語料庫語言學領域叢書

Language and Computers
Studies in Corpus Linguistics
English Corpus Linguistics（頁面存檔備份，存於網際網路檔案館）

其他書籍

Biber, D., Conrad, S., Reppen R. Corpus Linguistics, Investigating Language Structure and Use , Cambridge: Cambridge UP, 1998. ISBN 0-521-49957-7
McCarthy, D., and Sampson G. Corpus Linguistics: Readings in a Widening Discipline , Continuum, 2005. ISBN 0-8264-8803-X
Facchinetti, R. Theoretical Description and Practical Applications of Linguistic Corpora . Verona: QuiEdit, 2007. ISBN 978-88-89480-37-3
Facchinetti, R. (ed.) Corpus Linguistics 25 Years on. New York/Amsterdam: Rodopi, 2007 ISBN 978-90-420-2195-2
Facchinetti, R. and Rissanen M. (eds.) Corpus-based Studies of Diachronic English . Bern: Peter Lang, 2006. ISBN 3-03910-851-4

參見

外部連結

Bookmarks for Corpus-based Linguists – very comprehensive site with categorized and annotated links to language corpora, software, references, etc.（頁面存檔備份，存於網際網路檔案館）
Corpora discussion list
Freely-available, web-based corpora (100 million – 400 million words each): American (COCA, COHA), British (BNC), TIME, Spanish, Portuguese（頁面存檔備份，存於網際網路檔案館）
Manuel Barbera's overview site（頁面存檔備份，存於網際網路檔案館）
Przemek Kaszubski's list of references
AskOxford.com（頁面存檔備份，存於網際網路檔案館） the composition and use of the Oxford Corpus
DMCBC.com* Datum Multilanguage Corpora Based on chinese free sample download Archive.is的存檔，存檔日期2012-12-08
Corpus4u Community（頁面存檔備份，存於網際網路檔案館） a Chinese online forum for corpus linguistics
McEnery and Wilson's Corpus Linguistics Page（頁面存檔備份，存於網際網路檔案館）
Corpus Linguistics with R mailing list（頁面存檔備份，存於網際網路檔案館）
Research and Development Unit for English Studies（頁面存檔備份，存於網際網路檔案館）
Survey of English Usage（頁面存檔備份，存於網際網路檔案館）
The Centre for Corpus Linguistics at Birmingham University（頁面存檔備份，存於網際網路檔案館）
Gateway to Corpus Linguistics on the Internet（頁面存檔備份，存於網際網路檔案館）: an annotated guide to corpus resources on the web
Biomedical corpora
Linguistic Data Consortium, a major distributor of corpora
Penn Parsed Corpora of Historical English（頁面存檔備份，存於網際網路檔案館）
Corsis（頁面存檔備份，存於網際網路檔案館）: (formerly Tenka Text) an open-source (GPLed) corpus analysis tool written in C#
ICECUP（頁面存檔備份，存於網際網路檔案館） and Fuzzy Tree Fragments（頁面存檔備份，存於網際網路檔案館）
Research and Development Unit for English Studies（頁面存檔備份，存於網際網路檔案館）
Discussion group text mining

Corpus of Political Speeches（頁面存檔備份，存於網際網路檔案館），可搜尋美國、香港、台灣及中國的演講稿，由香港浸會大學圖書館提供（頁面存檔備份，存於網際網路檔案館）

[1] Sinclair, J. 'The automatic analysis of corpora', in Svartvik, J. (ed.) Directions in Corpus Linguistics (Proceedings of Nobel Symposium 82) . Berlin: Mouton de Gruyter. 1992.

[2] Meurman-Solin, Anneli; Nurmi, Arja. Annotation, Retrieval and Experimentation. Annotating Variation and Change. Helsinki: Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki. 2007 [2021-10-16]. OCLC 780136367. （原始內容存檔於2021-10-19）（英語）.

[3] Quirk, R. 'Towards a description of English Usage', Transactions of the Philological Society . 1960. 40–61.

[4] Darnell, Regna. Canadian languages in their social context. Carbondale: Linguistic Research. 1979 [2021-10-16]. ISBN 978-0-88783-003-7. OCLC 257958435. （原始內容存檔於2021-10-19）（英語）.

[5] Poplack, S. The care and handling of a mega-corpus. In Fasold, R.& Schiffrin D. (eds.) Language Change and Variation , Amsterdam: Benjamins. 1989. 411–451.

[6] Andersen, Francis I; Conrad, Edgar W; Newing, Edward G. Perspectives on language and text: essays and poems in honor of Francis I. Andersen's sixtieth birthday, July 28, 1985. Winona Lake, Ind.: Eisenbrauns. 1987 [2021-10-16]. ISBN 978-0-931464-26-3. OCLC 14588192. （原始內容存檔於2021-10-19）（英語）.

[7] Dukes, Kais; Atwell, Eric; Habash, Nizar. Supervised collaboration for syntactic annotation of Quranic Arabic. Language Resources and Evaluation. 2013-03, 47 (1): 33–62. ISSN 1574-020X. doi:10.1007/s10579-011-9167-7 （英語）.

[8] Wallis, S. and Nelson G. 'Knowledge discovery in grammatically analysed corpora'. Data Mining and Knowledge Discovery , 5 : 307–340. 2001.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]