佛經中英語料非監督式自動句對齊之研究=Unsupervised Sentence Alignment of Corpora of Chinese-English Buddhist Texts

呂玟儀

Author

呂玟儀

Date

2019

Pages

1 - 164

Publisher

法鼓文理學院

Publisher Url

https://www.dila.edu.tw/

Location

新北市, 臺灣 [New Taipei City, Taiwan]

Content type

博碩士論文=Thesis and Dissertation

Language

中文=Chinese

Degree

master

Institution

法鼓文理學院

Department

佛教學系

Advisor

洪振洲、王昱鈞

Publication year

107

Keyword

句對齊; 中英文佛經句對語料庫; 動態規劃演算法; Sentence Alignment; The Parallel Corpora of Chinese and English Buddhist Texts; Dynamic Programming

Abstract

早在佛陀時代，世尊即透過多種不同語言、方言來傳授佛法。後來以文字記錄成佛經，更是隨著交通、經濟貿易將佛教更往外傳播，佛經也因此被翻譯成許多不同的語言，像是早期翻譯成的吐火羅文、于闐文、犍陀羅語等，而後傳至漢地所譯成的漢譯佛經，傳至藏地所譯成的藏譯佛經。乃至近代隨著佛教傳至西方國家，更是將佛經翻譯成許多西方語言版本。語言是人際之間得以順利交流的重要溝通工具。而現今線上翻譯的出現，使得語言的學習與人際溝通更快速又便利。目前人工智慧深度學習技術的發展大大地提昇了自動翻譯系統的準確度。而此翻譯技術，首先必須建立一個擁有大量對譯語言之間，以句子為單位的平行語料庫。然而，目前佛經方面仍然缺乏如此句層級相互平行對應的大量數位語料。
所以，本論文將針對非監督式自動「句對齊」的方法進行研究，以找到一個適當的演算法，高效地完成佛經中英文文本自動「句對齊」工作。本研究以《大正新脩大藏經》中，第一部經典《長阿含經》中英文本裡隨機挑選出其中的部分段落與小經，和《佛說阿彌陀經》中英文本來作為主要實驗對象。我們首先將中英文二個文本各自進行斷句、分詞後，並將英文句子轉換成一組英文詞群，也把中文句子使用整合了佛學、古漢語和一般性英漢詞典的中英對應詞彙資料轉換成一組中英譯詞群。然後將英文與中英譯二組詞群進行比對，找出二組詞群中所共有的詞彙，利用資訊檢索的概念計算來計算二組詞群之間的相似度分數，加上搭配動態規劃演算法，推算出佛經中英文文本之間最佳的「句對齊」狀況。
實驗的結果分為嚴格與寬鬆二組標準來評估，評估數據顯示：嚴格準確率平均為0.5957；嚴格召回率平均為0.6774；嚴格F1度量平均為0.6335；寬鬆準確率平均為0.7847；寬鬆召回率平均為0.7133；寬鬆F1度量平均為0.7454。為了提高演算法效能，本論文針對影響演算法效能的錯誤比對結果，深入分析在本實驗中明顯發現影響相似度判斷的狀況，像是：中文詞彙的英文定義不足、大量比對不到的多餘詞彙、分詞的錯誤、中英文斷句方式差異過大、中英句對對應結構過於複雜等，並提出可行的改善建議，期許建立一個更高準確率的佛經中英文語料自動「句對齊」的模組，以進一步自動化完成大量又準確的中英文佛經句對語料。

The Buddha taught the dharma with a variety of dialects or languages. Afterward, the teachings of the Buddha were preserved orally for a long time before being eventually written down. With the spread of Buddhism, the Buddhist texts were translated into many different languages. The Buddhist texts were translated into Chinese since the Han Dynasty and then began to be translated into Tibetan during the Tang Dynasty. In modern times, as Buddhism spread to Western countries, the Buddhist texts were translated into many Western languages.
Language is an important tool of smooth communication between people. Today, online translation tools make learning language and communication with each other faster and more convenient. At present, the development of deep learning in artificial intelligence greatly improves the precision of the automatic translation system. To achieve acceptable translation performance, these methods require a corpus with a large number of parallel sentences in both languages for training. However, although there are many Buddhist texts in different languages, it still lacks a well-constructed parallel sentence aligned corpus.
Therefore, this thesis studies the method of the unsupervised sentence alignment and finds an appropriate algorithm to efficiently deal the sentence alignment of all Chinese-English Buddhist texts. In this study, for evaluations, several sutras with both Chinese and English versions are selected, such as some of the sutras in the "Chang Ahan Jing (Dīrgha Āgama)" and the "Foshuo Amituo Jing" from the "Taishō Shinshū Daizōkyō". Chinese and English texts are separated into sentences, and then segmented as words. For Chinese words, the English explanations are gathered from Chinese-English dictionaries to transform the Chinese words into English terms. Next, each sentence with words is transformed as a vector. To measure the similarity between two sentences now is regarded as the similarity of the two vectors. With the similarity measurement between two sentences, we adopt an alignment algorithm based on dynamic programming to generate the optimal sentence alignment results.
The results of the experiment are evaluated in precision and recall through two standards: rigid and relax. The evaluation results show that the average of the rigid precision, rigid recall, rigid F1-measure, relax precision, relax recall, and relax F1-measure are 0.5957, 0.6774, 0.6335, 0.7847, 0.7133, and 0.7454 respectively. The results show the effectiveness of our proposed method. After deeply examining and analyzing the error cases, several clues cause incorrect alignments, such as, insufficient English definition of Chinese terms, a large number of redundant terms, incorrect word segmentations, excessive difference in the sentence separation between Chinese and English, and construction of Chinese-English sentence alignment is too complicated etc. The goal of this thesis is to design a practical sentence alignment approach between Chinses and English Buddhist texts to build parallel corpo

Table of contents

一、研究動機與目的 1
二、相關文獻回顧 3
（一）、英譯佛經的現狀 3
（二）、數位佛經平行文本對應處理概況 4
（三）、中英文自動「句對齊」的研究 6
三、問題定義與文本觀察 8
（一）、問題定義 8
（二）、為何需要「句對齊」語料 8
（三）、佛經中英文「句對齊」樣貌 9
四、研究方法 11
（一）、「句對齊」方法概論 11
（二）、研究範圍與資料來源 14
（三）、中英文斷句分詞與英譯詞轉換 18
（四）、中英文句子對齊 23
1、計算中英文句子的相似度 23
2、動態規劃演算法 25
五、實驗評估 27
（一）、實驗評估設定 27
1、實驗資料 27
2、評估方法 29
3、結果評估呈現 32
（二）、實驗文本「句對齊」結果 33
（三）、實驗結果小結 36
六、分析與討論 38
（一）、程式建議句對與正確句對不相符的分析與探討 38
1、DA_Su01 38
2、DA_Su12 50
3、DA_Su19 66
4、DA_Su24 70
5、AMTB_BDK和AMTB_VTD 76
（二）、影響自動「句對齊」準確率之原因小結 83
1、中文詞彙的英文定義不足 83
2、大量多餘的詞彙 84
3、中文分詞錯誤 84
4、中英文斷句差異過大 85
5、中英句對對應結構過於複雜 85
6、其他影響因素 86
七、結論與展望 87
八、參考資料 89
附錄一、人工比對正確句對表 92
附錄二、分析表單中文句子的中英譯詞群 155

Hits

638

Created date

2021.08.12

Modified date

2023.01.07

Notice

You are leaving our website for The full text resources provided by the above database or electronic journals may not be displayed due to the domain restrictions or fee-charging download problems.

Record correction

Please delete and correct directly in the form below, and click "Apply" at the bottom.
(When receiving your information, we will check and correct the mistake as soon as possible.)

Serial No.
621083

Search History (Only show 10 bibliography limited)

Search Criteria Field Codes

	Search Criteria	Browse