Tibetan-Chinese-Sanskrit Text Alignment using Intelligent Agents and Genetic Algorithms

Handy, Christopher

作者

Handy, Christopher

出處題名

數位典藏與數位人文國際研討會（第9屆）=International Conference of Digital Archives and Digital Humanities (9th)

出版日期

2018.12.18

頁次

43 - 44

出版者

臺灣數位人文學會

出版地

臺北市, 臺灣 [Taipei shih, Taiwan]

資料類型

會議論文=Proceeding Article

使用語言

英文=English

附註項

1. Handy, Christopher: Principal Software Engineer, ERC Open Philology Leiden University.

關鍵詞

Tibetan; Chinese; Sanskrit; alignment; genetic algorithm; intelligent agent

摘要

The problem of multilingual text alignment is a frequent concern in the study of Buddhist texts. Often we find ourselves in possession of several Chinese, Tibetan and Sanskrit versions of a given textual work, without a clear sense of exactly how each individual text relates to the others. Two texts may contain some material in common without sharing all of their content, or share the bulk of their content but with phrases in different orders, or have a common vocabulary but no shared content at all. These issues are well known to philologists, and the idea of using computer software to alleviate some of the mechanical legwork in comparing texts has revolutionized the ways that we do research, within the narrow field of Buddhist studies and also much more broadly on any texts. Yet ancient texts, and especially ancient Asian texts, pose difficulties that prevent some popular text analysis methods commonly used for modern European languages from working properly with Tibetan, Chinese and Sanskrit. One desired task that is reasonably complex is to compare any two texts across these three languages, quantifiably measure how similar they are, and align the texts based on regions of similarity. The method I describe here can theoretically achieve this goal for any set of input texts in any language, but my examples are restricted to a specific set of Buddhist works in Chinese, Tibetan and Sanskrit called the Mahāratnakūṭa Sūtra (MRK). I demonstrate here a proof of concept on a few texts from this collection, and then discuss areas for improvement of the basic idea.

My method involves applying a genetic algorithm to intelligent agents to evolve the best alignments naturally from a given set of texts. Intelligent agents are computer programs designed to carry out a specific set of tasks using some kind of deterministic method and knowledge base. This type of system is useful when we know how to describe a decision process, but do not know all possible results of a decision. Genetic algorithms are information transmission schemes modeled on biological processes. They differ from biological processes in that we tend to specify a quantifiable end goal for them to reach without specifying the means of getting to the goal. By stating this goal in terms of a fitness algorithm, we can promote reproduction of agent genes in our model for those organisms least unfit according to the desired output (i.e., consistently improving accurate text alignments). Over multiple generations of this promotion, the gene pool of agents approaches 100% fitness (normally, an unreachable ideal). Genetic algorithms are useful for applications in which we know what we want our output to look like but have no idea how to get the results. For our text alignment problem, we have target words in our text that will be “most interesting” in the mathematical sense. We do not care if the computer finds these in the most efficient way, only that it reliably reports them. But, what is most interesting could change based on additional input witnesses. So, our system must adapt as it analyzes more texts.

Our agents in this scenario are tiny grammatical engines that each do a sequence of short alignment tasks between strings of syllables encountered in the input texts based on training they receive from manual alignments. By stacking sequences of successful organisms together, we can achieve various alignment suggestions from the model.

1. Introduction to the ERC Open Philology Project 80
1.1. The MRK Collection as a test project
1.2. The Buddhist Canon as a Digital Object: Resolution and Scope
1.3. The Problems of Current Software
2. Examples 86
2.1. Manual tests
2.2. Computer random tests
2.3. Assembling an organism
2.4. Massive population parallel problem solving
3. Conclusions 92
3.1. Interpretations of Data
3.2. Comparison of Automated and Human Alignments
3.3. Further research
4. Data 92

點閱次數

620

建檔日期

2019.01.28

更新日期

2019.02.26

提示訊息

您即將離開本網站，連結到，此資料庫或電子期刊所提供之全文資源，當遇有網域限制或需付費下載情形時，將可能無法呈現。

修正書目錯誤

請直接於下方表格內刪改修正，填寫完正確資訊後，點擊下方送出鍵即可。
(您的指正將交管理者處理並儘快更正)

序號
581033

檢索策略

瀏覽