Jun's Pocket Plane: Open Source OCR on Linux using GUI Frontends for Tesseract

2015년 1월 19일 월요일

Open Source OCR on Linux using GUI Frontends for Tesseract

Although I use Linux both at home and at work, for some tasks, like OCR for Korean and Chinese, I have had to rely on proprietary software on Windows (ABBYY Finereader provides excellent recognition results, by the way). This is starting to change, thanks to the tesseract OCR engine currently sponsored by Google.

Tesseract has been around for several years, but it wasn't easily accessible before the advent of GUI frontends that make it easy to select the area of an image to be recognized. The two more popular frontends to tesseract are YAGF (which also works with the Cuneiform OCR engine) and gimagereader both of which now use the QT framework (the latter used to be based on gtk, but in recent versions, QT can also be used).

Screenshot of YAGF

Screenshot of gimagereader

Tesseract's English-language recognition is almost on par with ABBYY Finereader for 300 dpi images, but much worse than Finereader at detecting images less than 300 dpi resolution. When it comes to non-English text, especially Asian text such as CJK (Chinese, Japanese, Korean) and other scripts, however, the performance of the tesseract engine still has a long way to go before matching the performance of Finereader.

YAGF doesn't give the option to use Asian languages, despite the existence of tesseract data files for many Asian languages. For example, here is a listing of the available tesseract-data packages for various languages in Archlinux:

[archjun@lenovoS310 cam1]$ sudo pacman -Ss tesseract-data
[sudo] password for archjun:
community/tesseract-data-afr 3.02.02-5 (tesseract-data)
Tesseract OCR data (afr)

...

community/tesseract-data-chi_sim 3.02.02-5 (tesseract-data)

Tesseract OCR data (chi_sim)

community/tesseract-data-chi_tra 3.02.02-5 (tesseract-data)

Tesseract OCR data (chi_tra)

...

community/tesseract-data-jpn 3.02.02-5 (tesseract-data)

Tesseract OCR data (jpn)

...

community/tesseract-data-kor 3.02.02-5 (tesseract-data) [installed]

Tesseract OCR data (kor)

...
community/tesseract-data-vie 3.02.02-5 (tesseract-data)
Tesseract OCR data (vie)

Piping the output through wc -l gives a line count of 130, divided by 2 (two lines per entry) gives 65 unique languages supported by Tesseract. As you can see in the sample output above, Asian languages CJK and Vietnamese are supported. According to the YAGF developer, Asian language OCR will be added to the GUI menu after European languages.

Fortunately, gimageview does support OCR for Asian languages as long as the necessary language data for tesseract has been installed. You may notice that the screenshot of gimagereader shows Korean text being recognized. Unfortunately, tesseract does a poor job of recognizing Korean. Although I haven't done a meticulous count, I would say off the top of my head that the results in the second screenshot above represent a recognition accuracy of maybe 70%. This is much worse than ABBYY Finereader. The tesseract-ocr project page offers some tips for improving OCR accuracy, such as upping the scan resolution, deskewing pages, etc., but the scanned image I used to test tesseract for Korean returned 90%+ OCR accuracy in ABBYY Finereader on Windows.

My conclusion: circa Jan. 2015, tesseract is good for English, not so good for Hangul/Korean.

댓글 없음:

댓글 쓰기

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----