Weakly Supervised Learning Based Historical Chinese Document Recognition
Abstract: Document recognition technology has wide applications and plays an important role in digitalization of historical documents. However, it has gained limited attention in Chinese historical document digitization due to the limitation of performance. Particularly, it performs well only for historical documents with simple layout and regular characters. To improve the performance of Chinese historical document recognition and push forward applications, this project systematically studies the theory and key techniques in Chinese historical document recognition, and to realize effective methods and algorithms. Based on the characteristics of ancient Chinese documents (style variation, frequent touching characters and variant characters), this project proposes a technical framework to historical Chinese document recognition based on weakly supervised learning. The main contents and innovations are as follows: (1) text extraction from historical documents based on graph-based semi-supervised learning; (2) character classifier adaptation based on deep neural networks; (3) variant Chinese character detection and classifier design based on active learning; (4) text line recognition for historical documents based on weakly supervised learning. The proposed techniques will be effective to improve the performance of Chinese historical document recognition and attract attention in the academia.
Keywords: weakly supervised learning; offline text recognition; writer Adaptation; convolutional neural networks; historical document