Jun's Pocket Plane: Webscraping pages containing non-UTF-8 CJK text with BeautifulSoup 4 뷰티플수프로 텍스트 추출하기

2014년 8월 4일 월요일

Webscraping pages containing non-UTF-8 CJK text with BeautifulSoup 4 뷰티플수프로 텍스트 추출하기

In the past, I have written some posts (JS: 오른쪽 마우스 클릭 차단 그만! and JS: 오른쪽 마우스 클릭 차단 Pt II) in Korean about how to apply Greasemonkey user scripts to non-programmatically copy text from web pages which have disabled right-click and copy-paste.

The blogging platforms popular in Korea (Naver, Daum/Tistory, etc) are referred to as "카페" (cafés) and generally block right-click (oncontextmenu) by default. The content I am interested in copying from some of these blog cafés includes public-domain classical Korean poems for which the copyright obviously doesn't reside with the blog itself. Such content shouldn't be locked up behind Javascript that disables basic browsing features.

The problem with Greasemonkey user scripts like Anti-Disabler, however, is that they don't work on all sites and aren't updated often enough to deal with changes in the anti-copying Javascript plugins from Daum and Naver.

Here's where BeautifulSoup comes in handy. Using bs4 (BeautifulSoup 4.2.0) for Python 3 (which uses UTF-8 by default, great for CJK), I scraped an article from a Korean news site as well as from a locked-down blog. Here's some sample code:

Korean news sites use tons of banner ads reminiscent of densely packed neon lights from entertainment districts in Gangnam, Hong Kong, or Tokyo along with funky CSS layout that sometimes make it hard to copy-and-paste an entire article. With Beautiful Soup, we can avoid all the bling and just get pure text.

Here's the .get_text() output of the whole article from Chosun.com:

Beautiful Soup also works great on right-click disabled web pages. Here's a snippet of text from an article about SEO for the Korean search engine Naver:

Note: Beware of possible encoding problems when you save .html files locally and try to parse them with BeautifulSoup using the open() method. Many webpages written in Chinese Japanese Korean (CJK) are still not encoded in UTF-8, instead using older formats such as SHIFT JIS, GBK, EUC-KR, and various Code Pages for Asian languages. These encodings are properly detected and decoded by BeautifulSoup, but the problem occurs when your system locale differs from the encoding of the .html file you are trying to save.

For example, my desktop Linux system uses en_US.UTF-8 for LANG and LC_... settings. Therefore when I save a text file with a non-UTF-8 encoding like EUC-KR, it is automatically saved as en_US.UTF-8, the current locale! The problem is that the EUC-KR encodings are invalid as UTF-8, so when you try to parse the .html file with BeautifulSoup, you will get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x0x... in position 123: invalid start byte

Since the file has been saved as UTF-8, BeautifulSoup expects to find that encoding, but chokes when it finds EUC-KR instead. When opening a URL, by contrast, BeautifulSoup doesn't run into this problem of inconsistent encodings.

I have yet to succeed at using BeautifulSoup on a EUC-KR encoded webpage saved locally with that encoding. In Emacs, I specify the encoding for the file to be saved with C-x C-m f RET euc-kr RET but when I run file --mime localFile.html the console tells me the file is encoded as Latin-1 iso-8859-1!

댓글 없음:

댓글 쓰기

GPG Public Key

-----BEGIN PGP PUBLIC KEY BLOCK-----

Version: GnuPG v2.0.22 (GNU/Linux)

mQENBE7vMHIBCADicCu8p52h2LRAaWZYLoR8BsKptqeJ9O5BWnDtobQGAFa5Xua7

FmrZmhYxKp2vzvlonWmloOP60Zgbxj9rt13S33SLthO+PcKneQkg4dBy/L8fxUaX

8L3n++I/i/qh4l7udUH9QoKNXeHDrAxgJfWcK4eXfImFkIc3EQhz/Ib7mEhIRSbP

gViohOjfwLNy07uf00DjEMvnlF/KY6LfoEEQUvIDmqembQrRXc2castWjL/Hjxae

seEOonMMuvPkvfcJrzfG8F7HJnRs+7e5/HNYA3iNap7JE1cb1huwXIqU7vh4Rd/R

Gq0kVOcGqszhyfuMNmbbbTNBEmzFmSBFtAGLABEBAAG0G0p1biBHbyA8Z29qdW4w

NzdAZ21haWwuY29tPokBOAQTAQIAIgUCTu8wcgIbAwYLCQgHAwIGFQgCCQoLBBYC

AwECHgECF4AACgkQpWkwz3I1E42NBggA19ciarF8DoPk+myx0AhEw8daDsZQ4sl4

j7EBDuB0hHDDthX2jKgPwqenYutF0+2EZQ5VS6kiyFCenK4wtYzkSVwlYuoiUbla

m0EPv8dA0f46/dxRhO2zoF8kfpmnR6BTR+EB+jVM+Mwpmc3shbpspnWPcRH/xTph

YLcURumTrfyIN++SeqSGcw20wg/+zqxclgOkwzZi9K4qIbdI9alPFsP14/xB+dcG

Ukows/TY7/eG3XGgiAE9tLLh99viBdLpSZ5T3GJlOGGnK+8EVZc9VG3yqxSURvBq

+X2CzYUmjsQUo0b7mQxmvFtWmjTbNc4lfWP3kkQdrUOQIe5J8rGfvLkBDQRO7zBy

AQgAsQC6mcxBHSZQzy8NwgZlQZSSx9zFjFVYkgr4xHym67PnkGs2opEvH0SawwMm

LM1/rCWVEeFcHQVaQ41z0Iu2WRIrqzbHreT730R8DqpYGICSp6wbPR5/AfVnwhcf

5I1Vos+cGzhW3kgsrpBCkKfhhDtRY5tseRm/TDMv1SGowsXVEIM/eSqvcNPkPa0f

am7Ah/sXYDg9om7wXbmLhPUz2RfPfHRYSDvRV9lIcvU0+jjVAwfpf3niPgZfsnU2

5smARZtjS1o0/pcFkrcKLE6VeVKFe2VxqJvFtKf4juxaIO1okxtwVcWfCbAGQplb

YiIYz2M7YuD5vqeexrEzxosveQARAQABiQEfBBgBAgAJBQJO7zByAhsMAAoJEKVp

MM9yNRONCb0H/1mV9EPt32R3ZYbUfO07V7GiMNYRZwfTW7ccGROwH1pzaI3ljQKM

FvXOmWg71yNTSiG9eBeSBIpLUXtIwmZvFzOG3B30msZBTStM605hZ9QV0PLxJNdm

61MlZ2EqFqTQYPMKz4Jsn5nZ9FH8wxUJ3QL5zMunE80AjQY4KV7cBswUKQjoDYVq

YIPVjsnchFduIcAMcpwKzTuMbqQih+mrjhr68Zusd44Lhr1g2qGQGCZXRn9/9oOQ

jBMXpeMhJMG/iyTdbO8PNbLFqu4QpHJJzRMphFVkFSBmlqDPcVgoeMazWhQMBg37

No+8Bq/f7QdNm+EJ/DHttuaJXDehVAFYnWE=

=PeRG

-----END PGP PUBLIC KEY BLOCK-----