除自動由原XML檔案產多版本資料以供瀏覽外,系統還包括兩個最主要之特色:「詞彙分析」與「引言查詢」。其中,「詞彙分析」是利用Mutual Information來計算特徵值與特徵值之間的相似性,將低頻詞彙取出,並以 Single Link Clustering的方法,讓詞彙自動叢聚為相關之集合。「引言查詢」則是混合Retrieval(計算Query 和文件的Similarity)和Text Search(String Matching)兩種方法,亦即首要考慮文件和查詢字串之相似性,接著再考慮查詢字串之關鍵字出現的次序(Term Sequence),其目的在將古文獻所引用之一段不完整的敘述(例如:缺字、錯誤或多字)以此容錯方式搜尋出來。
For ancient books and articles, it is a common phenomenon that multiple versions exist due to a variety of reasons. To content experts of these ancient books and articles, comparison between different versions is an important research task and may provide important insights. Since a large volume of the ancient books and articles has been digitized, modern information processing technologies should be employed to facilitate the tasks of content experts. This thesis discusses the design of a browsing and search system aimed at handling multiple-version ancient materials. The browsing and search system presented in this thesis facilitates not only browsing of multiple-version materials but also search of imprecise quotations. Imprecise quotation is an interesting issue because in ancient books and articles quotations are often not explicitly identified and may differ from the origin by a few terms or sentences. This thesis employs mutual information and approximate string matching to tackle this problem.