佛教語料「品目自動對列」研究--以《法華經》藏漢譯本為例=A Study of Automatic Chapter-level Alignment for Buddhist Texts: A Case Study for Tibetan and Chinese Version of the Lotus Sutra
One important research issue of Tibetan and Chinese Buddhist studies is the comparison and demarcation for Tibetan and Chinese Buddhist texts in perspectives of linguistics and philology. However, Buddhist studies still failed to establish the precision and reliability on its texts. In fact, few Sanskrit scriptures have been remained nowadays. The study on Tibetan and Chinese Buddhist texts with no doubt is one possibly practical way to identifying gaps and demarcating translations of Tibetan and Chinese. Buddhist texts have been spread for thousands of years. There exist a lot of problems due to different translations or versions. It is very difficult to solve problems by examining individual translation only. On the contrary, we have to iteratively investigate different translations or version as cross-references. However, such kind of work traditionally relied on Buddhist scholars themselves. It took a lot of human power and time but was only practical for small-scale researches. Based on the aforementioned phenomena, this study proposed an approach of automatic chapter-level alignment for Tibetan and Chinese texts. The purpose is to reduce the time cost and human cost in processing texts and to allow Buddhist scholars focusing on demarcation and annotation of Buddhist texts that cannot be done by computer systems. We applied Tibetan-Chinese dictionaries and built vector-space processing models based on related theories and techniques of information retrieval (IR) and computational linguistics (CL). The Tibetan and Chinese testing Buddhist texts, Saddharma-puṇḍarīka sutra, were collected from Taipei edition of Tibetan Tripitaka of Saddharmapuṇḍarīka Databaseand and The Taishō Shinshū Daizōkyō of Chinese Tripitaka of CBETA, respectively. In addition, the effects of stop words and bilingual dictionary to the proposed approach were investigated. Two types of bilingual dictionaries, Tibetan-Chinese Great Dictionary by Zhang Yi-Sun (a general dictionary) and Mahāvyutpatti by Ryozaburo Sakaki (a professional dictionary), were used in this study. Two models with/without using stop words were implemented and then compared as well. The experimental results showed that the proposed model with CKIP segmentation tool and professional Buddhist dictionary demonstrated its satisfied performance in finding true aligned chapter within Top 2 candidates. In contrast, simple n-gram matching with professional Buddhist dictionary also returned true aligned chapter within Top 3 candidates. It concluded that an appropriate professional Buddhist dictionary had its key role in Buddhist chapter-level alignment. In addition, stop-word list only showed its effectiveness in simple n-gram matching. To sum up, automatic chapter-level alignment for Tibetan and Chinese Buddhist Texts is feasible.