Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR.
By: Saman Idrees and Hossein Hassani.
Publisher: MDPI AG, 2021.
Publication Name: Applied Sciences
Applications based on Long-Short-Term Memory (LSTM) require large amounts of data for their training. Tesseract LSTM is a popular Optical Character Recognition (OCR) engine that has been trained and used in various languages. However, its training becomes obstructed when the target language is not resourceful. This research suggests a remedy for the problem of scant data in training Tesseract LSTM for a new language by exploiting a training dataset for a language with a similar script. The target of the experiment is Kurdish. It is a multi-dialect language and is considered less-resourced. We choose Sorani, one of the Kurdish dialects, that is mostly written in Persian-Arabic script. We train Tesseract using an Arabic dataset, and then we use a considerably small amount of texts in Persian-Arabic to train the engine to recognize Sorani texts. Our dataset is based on a series of court case documents in the Kurdistan Region of Iraq. We also fine-tune the engine using 10 Unikurd fonts.. [1]
=KTML_Link_External_Begin=https://www.kurdipedia.org/docviewer.aspx?id=445082&document=0001.PDF=KTML_Link_External_Between= Click to read the article: Exploiting Script Similarities to Compensate for the Large Amount of Data in Training Tesseract LSTM: Towards Kurdish OCR=KTML_Link_External_End=
کوردیپێدیا بەرپرس نییە لە ناوەڕۆکی ئەم تۆمارە و خاوەنەکەی لێی بەرپرسیارە. کوردیپێدیا بە مەبەستی ئەرشیڤکردن تۆماری کردووە.
ئەم بابەتە بەزمانی (English) نووسراوە، کلیک لە ئایکۆنی
بکە بۆ کردنەوەی بابەتەکە بەو زمانەی کە پێی نووسراوە!
This item has been written in (English) language, click on icon
to open the item in the original language!
ئەم بابەتە 600 جار بینراوە
ڕای خۆت دەربارەی ئەم بابەتە بنووسە!