《中阿含經》與《增壹阿含經》之文本翻譯風格量化分析與相似斷詞自動化擷取=Quantitative Analysis of Translation Styles and Automatic Similar Phrases Identification of the Madhyama-āgama and the Ekottarika-āgama
《大正藏》經號T 26《中阿含經》與T 125《增壹阿含經》,兩經之譯者皆記載為僧伽提婆;對於現存《中阿含經》的譯者記錄,目前學界還沒有人提出異議,但是對於現存《增壹阿含經》的譯者記錄,各家說法不一,學界對此尚無定論。本研究嘗試利用統計量化分析的方式,對《中阿含經》與《增壹阿含經》進行翻譯風格分析,以此探討現存《中阿含經》與《增壹阿含經》是否來自相同譯者的作品。研究方法為:以「可變長度n-gram」(variable length n-gram,VL n-gram)為切詞方法,經由適當的篩選門檻找出風格特徵詞,再搭配主成分分析法(principal components analysis,PCA)進行統計分析,以之觀察兩經的翻譯風格是否具有一致性。分析結果顯示,兩經的翻譯風格有顯著的差異。本研究同時使用人工比對的方式從已經找出來的眾多風格特徵詞中尋找意義相似的斷詞,以此觀察兩個文本是否有用字不同卻是意義相似的詞彙或短語。經過人工判讀後,找到諸多例證顯示兩個文本翻譯風格之差異受到譯者用字習慣的影響。研究結果顯示,現存漢譯《中阿含經》和《增壹阿含經》,有極高的機率不是來自相同譯者的作品。在研究過程中,有鑑於以人工比對所需投入的大量工時,本研究也嘗試尋找一個自動化識別相似斷詞的方法,期能提高研究效率,並且因應日後巨量詞組的比對需求。我們以「最長共同子序列」(longest common subsequence,LCS)作為兩兩斷詞之間相似程度的衡量方法。實驗結果顯示,此衡量方法之成效雖非顯著,然而對於大量詞組的比對,仍不失為一個可用的方法;在演算結果中可能包含著關鍵性的線索,能夠提供學者作為進一步研究之用。
In the Taishō Tripiṭaka, the translators of the Madhyama-āgama (T 26) and the Ekottarika-āgama (T 125) are both attributed to the same person, Gautama Saṅghadeva. So far, no one doubts the translator of the Madhyama-āgama is Gautama Saṅghadeva but there are different opinions among scholars concerning the translator of the Ekottarika-āgama. This study attempts to analyze the translation style of the Madhyama-āgama and the Ekottarika-āgama by quantitative methods, and discuss whether these two collections are the works of a same translator. The research methods are as follows: (1) the variable length n-gram (VL n-gram) is used to split text of T 26 and T 125 into shorter segments, called gram, (2) the grams that are used in more than an arbitrary threshold documents are adopted as “style features”, and (3) applying the principal components analysis (PCA) to the frequency of the style features of T 26 and T 125, the consistency of the translation style of these two collections is analyzed. The results from the statistical analysis show that the translation styles of these two collections are significantly different. In order to further strengthen the analysis results, we manually check the style features of the two collections to look for different phrase but sharing similar meanings in different collections. After the manual comparison, we find many examples indicating that the differences in translation styles between the two collections are indeed affected by the translator’s choice of word. These results again confirm the fact that the Madhyama-āgama and the Ekottarika-āgama are probably not the works of a same translator. Seeing the drawback of manual comparison which required a huge contribution of man-hours, this study also attempts to provide a solution to automatically identify similar phrases in order to reduce the man-hours and improve the research efficiency. We use the longest common subsequence (LCS) as a measurement for the degree of similarity between two phrases. The experimental results show that although the effect of LCS is not as significant, yet it is still a useful method to compare large data of phrases and some computational findings may suggest clues that intrigue further scholastic research.