一種多模型融合的中文古籍OCR後處理方法=A Post-OCR Method of Multi-Model Ensemble for Chinese Ancient Scriptures

釋賢度 (著)=Shih, Hsien-du (au.)

作者

釋賢度 (著)=Shih, Hsien-du (au.)

出處題名

數位典藏與數位人文=Journal of Digital Archives and Digital Humanities

卷期

n.11

出版日期

2023.04

頁次

83 - 104

出版者

臺灣數位人文學會

出版者網址

https://tadh.org.tw/

出版地

臺北市, 臺灣 [Taipei shih, Taiwan]

資料類型

期刊論文=Journal Article

使用語言

中文=Chinese

關鍵詞

post-OCR; 古籍=Ancient Scriptures; 模型融合=model ensemble; 版面分析=layout analysis; 深度學習=deep learning

摘要

本文提出一種多模型融合的OCR後處理方法，採用獨特的版面分析和對齊算法，整合了整頁檢測模型、字識別模型、列識別模型與語言預訓練模型等深度學習模型，實現了超越單一模型的效果。全文錯誤率達到1.64%，僅為單一模型平均錯誤率的23%。在各類常規古籍版式場景中，該方法具有較好的泛用性。

This paper proposes a post-OCR method of multi-model ensemble, which uses a unique layout analysis and alignment algorithms, and integrate different types of deep learning models, such as the full-page character detection model, character recognition model, line recognition model and language pre-training model, and achieves effects beyond a single model. The full-text error rate reaches 1.64%, which is only 23% of the average error rate of a single model. In various conventional ancient book layout scenarios, this method has good generalization.