佛經中英語料非監督式自動句對齊之研究=Unsupervised Sentence Alignment of Corpora of  Chinese-English Buddhist Texts

呂玟儀 (著)

Author

呂玟儀 (著)

Source

全國佛學論文聯合發表會論文集（第30屆）

Date

2019.09

Pages

1 - 20

Publisher

法鼓文理學院

Publisher Url

https://www.dila.edu.tw/traffic1

Location

新北市, 臺灣 [New Taipei City, Taiwan]

Content type

會議論文=Proceeding Article

Language

中文=Chinese

Note

作者為法鼓文理學院佛教學系。

Keyword

句對齊=Sentence Alignment; 中英文佛經句對語料=The Parallel Corpora of Chineses and English Buddhist Texts; 動態規劃演算=Dynamic Programming

Abstract

隨著交通、經濟貿易將佛教更往外傳播,佛經也因此被翻譯成許多不同的語言,像是中文、藏文、英文等語言版本。而現今線上翻譯的出現,使得語言的學習與人際溝通更快速又便利。目前人工智慧深度學習技術的發展更是大大地提昇了自動翻譯系統的準確度。而此翻譯技術,首先必須建立一個大量對譯語言之間,以句子為單位的平行語料庫。然而,目前佛經方面仍然缺乏以句子為單位相互平行對應的大量數位語料。所以,本論文將針對非監督式自動「句對齊」的方法進行研究,以找到一個適當的演算法,高效地完成佛經中英文文本自動「句對齊」工作。本研究,以《大正新脩大藏經》中,第一部經典《長阿含經》中英文本中,挑選出其中二個小經和《佛說阿彌陀經》中英文本來作為主要實驗對象。首先將中英文二個文本各自進行斷句、分詞後,並將英文句子轉換成一組英文詞群,也把中文句子以整合了佛學、古漢語和一般性英漢詞典的中英對應詞彙資料,轉換成一組中英譯詞群。然後將英文與中英譯二組詞群進行比對,找出二組詞群中的共有詞彙。接著,利用資訊檢索的概念來計算二組詞群之間的相似度分數,加上搭配動態規劃演算法,推算出佛經中英文文本之間最佳的「句對齊」狀況。實驗的結果準確率平均為0.5442;召回率平均為0.645; F1度量平均為0.5902。最後,本論文針對影響演算法效能的錯誤比對結果,深入分析在實驗中明顯發現
影響相似度判斷的狀況。期許建立一個更高準確率的佛經中英文語料自動「句對齊」的模組,以進一步自動化完成大量又準確的中英文佛經句對語料。

With the spread of Buddhism, the Buddhist texts were translated into many different languages, such as Chinese, Tibetan, English etc. Today, online translation tools make learning language and communication with each other faster and more convenient. To achieve the automatic translation system by the deep learning in artificial intelligence, we require a corpus with a large number of parallel sentences in both
languages for training. However, although there are many Buddhist texts in different languages, it still lacks a well-constructed parallel sentence aligned corpus. Therefore, this thesis studies the method of the unsupervised sentence alignment and finds an appropriate algorithm to efficiently deal the sentence alignment of all Chinese-English Buddhist texts. In this study, for evaluations, several sutras with both
Chinese and English versions are selected, such as some of the sutras in the "Chang Ahan Jing (Dīrgha Āgama)" and the "Foshuo Amituo Jing" from the "Taishō Shinshū Daizōkyō". Chinese and English texts are separated into sentences, and then segmented as words. For Chinese words, the English explanations are gathered from Chinese-English dictionaries to transform the Chinese words into English terms. Next, each sentence with words is transformed as a vector. To measure the similarity between two sentences now is regarded as the similarity of the two vectors. With the similarity measurement between two sentences, we adopt an alignment algorithm based on dynamic programming to generate the optimal sentence alignment results. The evaluation results show that the average of the precision, recall, and F1-measure are 0.5442, 0.645, and 0.5902 respectively. We deeply examine and analyze the error cases, several clues cause incorrect alignments. We will continue improving our method to achieve higher precision. Further, creating a practical sentence alignment approach between Chinses and English Buddhist texts to build parallel corpora.

Table of contents

一、研究動機與目的 1
二、相關文獻回顧 2
（一）數位佛經平行文本對應處理概況 2
（二）中英文自動「句對齊」的研究 2
三、研究方法 4
（一）研究範圍與資料來源 4
（二）中英文斷句分詞與英譯詞轉換 5
1.英文斷句分詞處理 5
2.中文斷句與取詞處理 6
3.中文詞群轉成中英譯詞群 7
（三）中英文句子對齊 7
1.計算中英文句子的相似度 7
（1）產生句向量 7
（2）向量相似度的計算 8
2.動態規劃演算法 9
四、實驗評估 11
（一）實驗設定 11
1.實驗資料 11
2.評估方法 12
（二）實驗結果小結 12
五、分析與討論 13
（一）中文詞彙的英文定義不足 13
（二）大量多餘的詞彙 13
（三）中文分詞錯誤 15
六、結論與展望 16
七、參考資料 16

Hits

546

Created date

2022.10.20

Modified date

2023.09.22

Notice

You are leaving our website for The full text resources provided by the above database or electronic journals may not be displayed due to the domain restrictions or fee-charging download problems.

Record correction

Please delete and correct directly in the form below, and click "Apply" at the bottom.
(When receiving your information, we will check and correct the mistake as soon as possible.)

Serial No.
652309

Search History (Only show 10 bibliography limited)

Search Criteria Field Codes

	Search Criteria	Browse