Tibetan-Chinese-Sanskrit Text Alignment using Intelligent Agents and Genetic Algorithms

Handy, Christopher

著者

Handy, Christopher

掲載誌

數位典藏與數位人文國際研討會（第9屆）=International Conference of Digital Archives and Digital Humanities (9th)

出版年月日

2018.12.18

ページ

43 - 44

出版者

臺灣數位人文學會

出版地

臺北市, 臺灣 [Taipei shih, Taiwan]

資料の種類

會議論文=Proceeding Article

言語

英文=English

ノート

1. Handy, Christopher: Principal Software Engineer, ERC Open Philology Leiden University.

キーワード

Tibetan; Chinese; Sanskrit; alignment; genetic algorithm; intelligent agent

抄録

The problem of multilingual text alignment is a frequent concern in the study of Buddhist texts. Often we find ourselves in possession of several Chinese, Tibetan and Sanskrit versions of a given textual work, without a clear sense of exactly how each individual text relates to the others. Two texts may contain some material in common without sharing all of their content, or share the bulk of their content but with phrases in different orders, or have a common vocabulary but no shared content at all. These issues are well known to philologists, and the idea of using computer software to alleviate some of the mechanical legwork in comparing texts has revolutionized the ways that we do research, within the narrow field of Buddhist studies and also much more broadly on any texts. Yet ancient texts, and especially ancient Asian texts, pose difficulties that prevent some popular text analysis methods commonly used for modern European languages from working properly with Tibetan, Chinese and Sanskrit. One desired task that is reasonably complex is to compare any two texts across these three languages, quantifiably measure how similar they are, and align the texts based on regions of similarity. The method I describe here can theoretically achieve this goal for any set of input texts in any language, but my examples are restricted to a specific set of Buddhist works in Chinese, Tibetan and Sanskrit called the Mahāratnakūṭa Sūtra (MRK). I demonstrate here a proof of concept on a few texts from this collection, and then discuss areas for improvement of the basic idea.

My method involves applying a genetic algorithm to intelligent agents to evolve the best alignments naturally from a given set of texts. Intelligent agents are computer programs designed to carry out a specific set of tasks using some kind of deterministic method and knowledge base. This type of system is useful when we know how to describe a decision process, but do not know all possible results of a decision. Genetic algorithms are information transmission schemes modeled on biological processes. They differ from biological processes in that we tend to specify a quantifiable end goal for them to reach without specifying the means of getting to the goal. By stating this goal in terms of a fitness algorithm, we can promote reproduction of agent genes in our model for those organisms least unfit according to the desired output (i.e., consistently improving accurate text alignments). Over multiple generations of this promotion, the gene pool of agents approaches 100% fitness (normally, an unreachable ideal). Genetic algorithms are useful for applications in which we know what we want our output to look like but have no idea how to get the results. For our text alignment problem, we have target words in our text that will be “most interesting” in the mathematical sense. We do not care if the computer finds these in the most efficient way, only that it reliably reports them. But, what is most interesting could change based on additional input witnesses. So, our system must adapt as it analyzes more texts.

Our agents in this scenario are tiny grammatical engines that each do a sequence of short alignment tasks between strings of syllables encountered in the input texts based on training they receive from manual alignments. By stacking sequences of successful organisms together, we can achieve various alignment suggestions from the model.

1. Introduction to the ERC Open Philology Project 80
1.1. The MRK Collection as a test project
1.2. The Buddhist Canon as a Digital Object: Resolution and Scope
1.3. The Problems of Current Software
2. Examples 86
2.1. Manual tests
2.2. Computer random tests
2.3. Assembling an organism
2.4. Massive population parallel problem solving
3. Conclusions 92
3.1. Interpretations of Data
3.2. Comparison of Automated and Human Alignments
3.3. Further research
4. Data 92

ヒット数

320

作成日

2019.01.28

更新日期

2019.02.26

注意：

この先はにアクセスすることになります。このデータベースが提供する全文が有料の場合は、表示することができませんのでご了承ください。

修正のご指摘

下のフォームで修正していただきます。正しい情報を入れた後、下の送信ボタンを押してください。
(管理人がご意見にすぐ対応させていただきます。)

シリアル番号
581033

検索条件

ブラウズ