Tuesday, 10 September 2013

Algorithm for separating Japanese characters one by one from an image using OpenCV

Algorithm for separating Japanese characters one by one from an image
using OpenCV

I have an application that needs separating Japanese characters one by one
from an image.
Input: an image with ONE line of Japanese text. It can have halfwidth
Katakana, halfwidth numbers, fullwidth Katakana, Hiragana and numbers as
well. Maybe halfwidth or fullwidth English characters as well. (let's
forget about English characters for the moment)
Issue: I can easily separate out the characters by using adaptive
thresholding, dilating and eroding. But there is one big issue.
Some of the Japanese characters have a space in between them. Like@ì,
'Ì, &lsqauo;x, "ñ. So simply looking at vertical white gaps doesn't help. Finding
the width doesn't help either because there can be fullwidth characters
(2btyte) or halfwidth characters (1byte). i seem to need an exquisite way
to do this.
any idea how i should proceed with this? any idea is a good idea :)
here are couple of sample images. (characters circled in red are the
problematic ones)
http://imageshack.us/a/img833/3810/e31z.png
http://imageshack.us/a/img12/2395/7mqn.png

No comments:

Post a Comment