2014년 6월 10일 화요일

Handy Imagemagick tools for cropping book images prior to OCR

It's annoying to edit out page headers and footers that have been automatically OCR'd. One way to avoid this problem is to manually specify the OCR area, but this can become quickly tedious if you have hundreds of pages to process.

Enter Imagemagick's convert command. Today I will talk about the -chop and -shave config flags for convert.

These commands are a lifesaver when we want to unneeded areas (i.e. footers/headers, page numbers, etc.) that appear in a constant location from multiple pages.

A useful resource for the various image-cropping options available in convert can be found here:

http://www.imagemagick.org/Usage/crop/

Consider the following scan from a book:


Every page contains a page number at the bottom and on every other page along the right margin the book's title written vertically. We want to crop all the book's pages s.t. the page number and vertical title won't appear in the final image -- this will make OCR go much faster as we won't have to manually select which area needs to be OCR'ed.
In this particular example, we can use

convert "스님의_주례사 - 0011.png" -gravity SouthEast -chop 250x200 SE_chopped250x200.png

Which will remove 250 pixels from the right (East) and 200 pixels from the bottom (South). In the case of the -crop config flag, the reference point (0, 0) for all pixel calculations is the top-leftmost corner of the canvas. By using the -gravity flag, however (quote from Imagemagick docs):

The direction you choose specifies where to position text or subimages. For example, a gravity of Center forces the text to be centered within the image. By default, the image gravity is NorthWest.

So -gravity SouthEast will make the reference point (0, 0) the bottom-rightmost corner of the canvas. Now the resulting chopped image looks like:



​As you can see in the above image, the extraneous text from the right and bottom margins has been cropped out!

In other books, however, the location of extraneous text might be different. Let's say you want to remove text from both the top and bottom (or left and right) of the following screencap:

Eliminating the viewing frame can be accomplished with the convert flag -shave, which shaves pixels from the edges of an image (top & bottom, left & right). The arguments of -shave are:

... -shave [numPixelsLeftRightEdges]x[numPixelsTopBottomEdges]

Note that the brackets above should not actually be typed out. So if you wanted to remove 100 pixels from the left and right edges, you would pass the following arguments to - shave:

... -shave 100x0

If you want to remove 100 pixels from the top and bottom edges:

... -shave 0x100

To remove 100 pixels from both the top & bottom as well as left & right edges:

... -shave 100x100

The single page with its top and bottom edges removed:

convert escape_from_evil_frame_ex.png -shave 0x50 shaved_0x50.png


Finally, one more example. Say we have the following screencap containing two facing  pages:


Let's use the -shave flag to remove extraneous areas from both the top and bottom, left and right to make the image more amenable to OCR.

convert two_page.png -shave 80x50 two_page_shaved_80x50.png

This command shaves 80 pixels from both the left and right as well as 50 pixels from the top and bottom leaving us with the following image:


To run any of the above commands on all the images in a directory, simply invoke convert with a wildcard. For example:

convert *.png -configFlag outputFilename.png

Imagemagick will automatically increment outputFilename: outputFilename0.png, outputFilename1.png...