Jun's Pocket Plane: Handy Imagemagick tools for cropping book images prior to OCR

2014년 6월 10일 화요일

Handy Imagemagick tools for cropping book images prior to OCR

It's annoying to edit out page headers and footers that have been automatically OCR'd. One way to avoid this problem is to manually specify the OCR area, but this can become quickly tedious if you have hundreds of pages to process.

Enter Imagemagick's convert command. Today I will talk about the -chop and -shave config flags for convert.

These commands are a lifesaver when we want to unneeded areas (i.e. footers/headers, page numbers, etc.) that appear in a constant location from multiple pages.

A useful resource for the various image-cropping options available in convert can be found here:

http://www.imagemagick.org/Usage/crop/

Consider the following scan from a book:

Every page contains a page number at the bottom and on every other page along the right margin the book's title written vertically. We want to crop all the book's pages s.t. the page number and vertical title won't appear in the final image -- this will make OCR go much faster as we won't have to manually select which area needs to be OCR'ed.
In this particular example, we can use

convert "스님의_주례사 - 0011.png" -gravity SouthEast -chop 250x200 SE_chopped250x200.png

Which will remove 250 pixels from the right (East) and 200 pixels from the bottom (South). In the case of the -crop config flag, the reference point (0, 0) for all pixel calculations is the top-leftmost corner of the canvas. By using the -gravity flag, however (quote from Imagemagick docs):

The direction you choose specifies where to position text or subimages. For example, a gravity of Center forces the text to be centered within the image. By default, the image gravity is NorthWest.

So -gravity SouthEast will make the reference point (0, 0) the bottom-rightmost corner of the canvas. Now the resulting chopped image looks like:

As you can see in the above image, the extraneous text from the right and bottom margins has been cropped out!

In other books, however, the location of extraneous text might be different. Let's say you want to remove text from both the top and bottom (or left and right) of the following screencap:

Eliminating the viewing frame can be accomplished with the convert flag -shave, which shaves pixels from the edges of an image (top & bottom, left & right). The arguments of -shave are:

... -shave [numPixelsLeftRightEdges]x[numPixelsTopBottomEdges]

Note that the brackets above should not actually be typed out. So if you wanted to remove 100 pixels from the left and right edges, you would pass the following arguments to - shave:

... -shave 100x0

If you want to remove 100 pixels from the top and bottom edges:

... -shave 0x100

To remove 100 pixels from both the top & bottom as well as left & right edges:

... -shave 100x100

The single page with its top and bottom edges removed:

convert escape_from_evil_frame_ex.png -shave 0x50 shaved_0x50.png

Finally, one more example. Say we have the following screencap containing two facing pages:

Let's use the -shave flag to remove extraneous areas from both the top and bottom, left and right to make the image more amenable to OCR.

convert two_page.png -shave 80x50 two_page_shaved_80x50.png

This command shaves 80 pixels from both the left and right as well as 50 pixels from the top and bottom leaving us with the following image:

To run any of the above commands on all the images in a directory, simply invoke convert with a wildcard. For example:

convert *.png -configFlag outputFilename.png

Imagemagick will automatically increment outputFilename: outputFilename0.png, outputFilename1.png...

댓글 없음:

댓글 쓰기

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----