2014년 8월 26일 화요일

Learning Traditional Chinese characters used in Korean (hanja) using a USB writing tablet and a NDS Emulator

In my spare time, I have been studying the 214 Kangxi Radicals (한글: 부수 部首) to help learn traditional Chinese characters faster. Thus far I have focused on just being able to recognize and read the radicals, but I have not tried writing them until now.

Some Koreans in their 50's and 60's suggested that I add writing practice to my study regimen, and the online Chinese/Japanese character writing site Skritter also champions this approach. The nice thing about learning radicals is that they are the building blocks of most Chinese characters - even complex Chinese characters can be broken down into radical components, making memorization that much easier.

If you have a USB tablet that accepts stylus input (I use the Huion H610, which works pretty well in Linux), you don't have to practice writing Chinese characters with ink and paper; Skritter and lots of software titles for the Nintendo DS teach writing Chinese complete with character overlays and stroke order information. I personally think Skritter is superior to similar offline-only NDS software, but Skritter is relatively expensive, at $15/month. Also, Skritter only supports Japanese-style Chinese characters (Kanji), Traditional Chinese, and Simplified Chinese. Also the pronunciation that is taught is only valid for Chinese and Japanese.

Although Korean uses Traditional Chinese characters, it uses its own special pronunciation for them. There are also some hanja characters that are uniquely Korean. For example, the pronunciation marks 丷 (구결자 하) and 乛 (구결자 야) are (to my knowledge) used exclusively in Korea.

I thus searched for Nintendo DS titles teaching Korean-style Chinese characters, and came up with 3 titles:

ROM# 3659 Magic Cheonjamun DS (마법천자문 DS)
ROM# 3867 HanGeom DS (한검 DS)
ROM# 5523 Mabeop Cheonjamun DS2: Choehuui hanjamabeop (마법천자문 DS2: 최후의 한자마법)

Among these three, the last title is the best. The first title is not playable in the Desmume emulator (a black rectangle artifact shows up in the bottom screen of the emulator when running the ROM), the second title is mostly quiz-based and expect users to already know all ~3600 common Chinese characters tested in high school.

There is an adventure/story mode, but perhaps because of my age I found it to be uninteresting. Much more useful is the dictionary mode which has a list of all the Chinese characters covered in the official Korean exam divided into levels 8 ~ 1 (특급). I went through the list to find only the radical characters to practice with.

Here is a video screen capture of Mabeop Cheonjamun DS2 in action:



2014년 8월 18일 월요일

Fortune-telling programs in Linux: qtarot

Qtarot 0.5.8 is a nice graphical Python program using the QT framework that displays a tarot layout and explanations for each of the cards in a reading.

Here's a screenshot of a sample reading:


The default layout shown above is "ellipse" and double clicking on any card will bring up a separate window to the right which gives an explanation of the card given its orientation (i.e. right side up, upside-down) and position.

For someone like me who hasn't memorized all the cards and their interpretations, Qtarot is very user-friendly. There are some tweaks that users must make, however, to get good-looking cards with higher resolutions.

The default cards provided by the Qtarot github repo (/decks/coleman-white) are very low resolution, 99 x 170 pixels, which look pretty horrible in full screen mode. They are installed in the following local path:

/usr/share/qtarot/decks/coleman-white/

These are the standard Rider-Waite public domain1 images used on most Universal Tarot decks. There are many higher-resolution versions of these same images available on the Internet, however. I found 350 x 600 pixel images of the Rider-Waite cards from the site www.learntarot.com. Fortunately, the site uses a predictable numbering scheme for the cards, so I was able to get large images of all 78 cards using batch download flags for wget:

wget http://www.learntarot.com/bigjpgs/maj{00..21}.jpg

The above is an example of downloading all 22 major arcana cards using wget's batch mode.

The numbering scheme for the four suits pentacles, cups, swords, and wands is similarly predictable, making it easy to download the rest of the deck.

But Qtarot cannot handle .jpg images, only .png, so you need to batch convert the jpegs. ImageMagick to the rescue!

mogrify -format png *.jpg

This above command will convert all jpegs to png format.

Once you have downloaded all the images, you must rename them according to the numbering scheme used by Qtarot. The numbering scheme is as follows:

0~21  the major arcana (The Fool ~ The World)
22~35 the wands (Ace of Wands ~ King of Wands)
36~49 the cups (Ace of Cups ~ King of Cups)
50~63 the swords (Ace of Swords ~ King of Swords)
64~77 the pentacles (Ace of Pentacles ~ King of Pentacles)

To save you the trouble of converting and renaming all the files, I encourage you to just download the archive of renamed png files I have made available on Google Drive at this link. Of course, you can use any tarot images you like as long as you convert your images to png and rename the image files according to the numbering scheme above.

Now overwrite the existing low-res images in /usr/share/qtarot/decks/coleman-white/ with your desired hi-res images.

Finally, edit /usr/share/qtarot/decks/coleman-white/deck.ini to make the hi-res cards appear larger:

[Deck Skin]
definition=Rider Waite
author=Pamela Colman Smith
desc=The original illustrations for the most widely-known deck.
width=1.0
height=2.0

The default values of width and height are 0.5 and 1.0, respectively, but that is way too small. I have doubled these values above.

Qtarot hasn't yet been packaged for Ubuntu/Debian distros, but the package qtarot-git in the AUR for Archlinux users is available and maintained by ShadowKyogre, Qtarot's developer.





1As of 2012, the images from the Rider-Waite tarot are in the public domain as 70 years has passed since the death of the original UK copyright holder. http://en.wikipedia.org/wiki/Rider-Waite_tarot_deck#Copyright_status

2014년 8월 7일 목요일

HackerRank as a better platform for solving Euler Project problems

Project Euler has a great archive of algorithm-oriented programming problems, but Project Euler is started to look a bit dated as MOOCs featuring online autograders for coding assignments become commonplace and general programming challenge sites like CodeEval, HackerRank, checkIO, etc. have online judges built-in.

Let's look at Problem 1 from PE:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.

The input domain is sufficiently small for brute-force solutions to work quickly. In Pseudocode:

1. Go through all numbers between 1 and 999 inclusive one at a time.
2. If a number is divisible by 3 or 5, put it in a separate list A.
3. Once you have finished going through all the numbers [1, 2, 3 ..., 999], return the sum of A.

In Python 3:



The above solution works fine for such a small domain. Here's the same problem on HackerRank:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below N.
Input Format 
First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.
Output Format 
For each test case, print an integer that denotes the sum of all the multiples of 3 or 5 below N.
Constraints
$$1\le T \le 10^5$$
$$1 \le N \le 10^9$$


Unlike the original problem from the Euler Project, the upper limit for N in the Hacker Rank version is much larger than 1000; it goes up to one billion! For really large N the naive approach we used before (checking divisibility by 3 or 5 for each number) will no longer work quickly enough as it has \(O(n)\) time complexity (linear time). Although the naive solution passes a few test cases from HackerRank's autograder, it fails cases with really large N.

Having larger input ranges is really great as it forces you to think of a better solution. Fortunately I had encountered a similar problem, Bullseye, from Google Code Jam 2013 Round 1A that required knowledge of how to analytically find the sum of an arithmetic progression.

You might recall from high school math class that if we have a finite sequence with a common difference between terms, i.e. $$\{3, 6, 9, 12, 15, 18\}$$ we can find \(S_n\), the sum of arithmetic progression, from the following formula:
$$S_n = \frac{(a_1 + a_n)n}{2}$$
where \(a_1\) is the first term and \(a_n\) is the n-th term in the progression. If we plug in \(a_1=3\), \(a_n=18\) from the 6-element sequence above, we get the sum \(\frac{21 \cdot 6}{2}=63\), which we can easily verify by manually adding up all six terms.

Also, the formula for calculating the n-th term in an arithmetic sequence or progression is given by:
$$a_n = a_1 + (n-1)d$$
where \(a_1\) is the first term in the sequence and \(d\) is the constant difference between terms.

Let's say we want to find the number of terms less than 1,000,000 in the arithmetic progression where the constant difference between terms is 5. We can find the number of terms as follows:
$$\lfloor\frac{(1000000 - 1)}{5}\rfloor = 199999$$
The bracket-like figures \(\lfloor\) and \(\rfloor\) denote rounding down, so the expression above is doing integer floor division.

We subtract 1 from 1000000 because one million is a multiple of 5, but the terms of the arithmetic progression must be less than one million. Therefore the number of terms in the arithmetic sequence \(\{5, 10, 15, 20, \cdots, 999990, 999995\}\) is 199999.

Using the formulas above, we can now solve the original problem in \(O(1)\) constant time. Here is the new version stated in Pseudocode:

1. Given an upper limit lim, find the sum of all numbers less than lim that are divisible by 3 and the sum of all numbers less than lim divisible by 5.

2. Subtract the sum of all numbers less than lim that are divisible by 15 (to account for double counting; i.e. multiples of 15 will be included in the arithmetic progression {3, 6, 9, 12, 15, ... 30, ..., lim} as well as in {5, 10, 15, 20, 25, 30, ..., lim} so we need to subtract the arithmetic progression {15, 30, 45, 60, 75, ..., lim} to remedy this)

3. Return the overall sum

Here is the implementation in Python 3:



Warning: It would not be a good idea to simply copy-and-paste the code above into the HackerRank autograder! The autograder looks for duplicate submissions and red flags non-unique solutions (i.e. answers that have already been submitted)!

2014년 8월 4일 월요일

Webscraping pages containing non-UTF-8 CJK text with BeautifulSoup 4 뷰티플수프로 텍스트 추출하기

In the past, I have written some posts (JS: 오른쪽 마우스 클릭 차단 그만! and JS: 오른쪽 마우스 클릭 차단 Pt II) in Korean about how to apply Greasemonkey user scripts to non-programmatically copy text from web pages which have disabled right-click and copy-paste.

The blogging platforms popular in Korea (Naver, Daum/Tistory, etc) are referred to as "카페" (cafés) and generally block right-click (oncontextmenu) by default. The content I am interested in copying from some of these blog cafés includes public-domain classical Korean poems for which the copyright obviously doesn't reside with the blog itself. Such content shouldn't be locked up behind Javascript that disables basic browsing features.

The problem with Greasemonkey user scripts like Anti-Disabler, however, is that they don't work on all sites and aren't updated often enough to deal with changes in the anti-copying Javascript plugins from Daum and Naver.

Here's where BeautifulSoup comes in handy. Using bs4 (BeautifulSoup 4.2.0) for Python 3 (which uses UTF-8 by default, great for CJK), I scraped an article from a Korean news site as well as from a locked-down blog. Here's some sample code:

Korean news sites use tons of banner ads reminiscent of densely packed neon lights from entertainment districts in Gangnam, Hong Kong, or Tokyo along with funky CSS layout that sometimes make it hard to copy-and-paste an entire article. With Beautiful Soup, we can avoid all the bling and just get pure text.

Here's the .get_text() output of the whole article from Chosun.com:


Beautiful Soup also works great on right-click disabled web pages. Here's a snippet of text from an article about SEO for the Korean search engine Naver:


Note: Beware of possible encoding problems when you save .html files locally and try to parse them with BeautifulSoup using the open() method. Many webpages written in Chinese Japanese Korean (CJK) are still not encoded in UTF-8, instead using older formats such as SHIFT JIS, GBK, EUC-KR, and various Code Pages for Asian languages. These encodings are properly detected and decoded by BeautifulSoup, but the problem occurs when your system locale differs from the encoding of the .html file you are trying to save.

For example, my desktop Linux system uses en_US.UTF-8 for LANG and LC_... settings. Therefore when I save a text file with a non-UTF-8 encoding like EUC-KR, it is automatically saved as en_US.UTF-8, the current locale! The problem is that the EUC-KR encodings are invalid as UTF-8, so when you try to parse the .html file with BeautifulSoup, you will get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x0x... in position 123: invalid start byte

Since the file has been saved as UTF-8, BeautifulSoup expects to find that encoding, but chokes when it finds EUC-KR instead. When opening a URL, by contrast, BeautifulSoup doesn't run into this problem of inconsistent encodings.

I have yet to succeed at using BeautifulSoup on a EUC-KR encoded webpage saved locally with that encoding. In Emacs, I specify the encoding for the file to be saved with C-x C-m f RET euc-kr RET but when I run file --mime localFile.html the console tells me the file is encoded as Latin-1 iso-8859-1!

2014년 7월 29일 화요일

OmegaT will fail to open a project if a git repo is in the project's parent directory or current directory

A few weeks ago, I encountered an issue in which OmegaT hangs when trying to open a project.

15256: Info: Docking Framework version: 2.1.4 
15256: Info: Event: application startup (LOG_INFO_EVENT_APPLICATION_STARTUP)
15256: Info: Click on 'projectMenu' menu item (LOG_MENU_CLICK)
15256: Info: Click on 'projectMenu' menu item (LOG_MENU_CLICK)
15256: Info: Click on 'projectOpenMenuItem' menu item (LOG_MENU_CLICK)
15256: Info: Git 'status' execution start (GIT_START)

This happens because OmegaT checks for team translation projects managed with git. The issue is discussed on the OmegaT support forums in this thread.

The workaround is to remove the offending .git folder that exists in the translation project's parent directory or the the translation project's current folder.

2014년 7월 22일 화요일

How to fix broken CJK filenames extracted from zip archive created in MS Windows environment

Most computers run some form of MS Windows, which doesn't natively use UTF-8 for character encoding. This can cause problems for Linux users who have to work with filenames using East Asian CJK (Chinese Japanese Korean) characters from a Windows environment. For single files sent as email attachments through Gmail, Google is smart enough to detect what code page the filename is encoded in and convert it to UTF-8 when the attachment is downloaded to a POSIX environment.

For archive files like .zip, however, compressed files named using CJK characters in a MS Windows environment will appear as gibberish in a UTF-8 locale.

Rather than booting up a Windows VM just to extract files from an archive, a faster method is to extract the compressed files while maintaining their original filename character encoding.

The following example will use this .zip file that was created on a Korean version of MS Windows. Korean language characters on Windows are encoded using Code Page 949, which is compatible with EUC-KR, the most-widely used character encoding in Korea.

First I will extract the file using 7z from the CLI, but create a modified environment with a different language encoding by using env and the LANG=... flag. This method was first described by developer Allen Choong in this post from 2013 in which he details converting filenames encoded in MS Windows GBK (Code Page 936) Simplified Chinese to UTF-8 after extraction from an archive file.

[archjun@arch Downloads]$ env LANG=C 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/������(����).doc.docx
Extracting  Korean Writers(2009)/������.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578

You can see that the filenames extracted from the archive are mangled, as they have a non-UTF8 character encoding.

Note that the LANG variable can also be set to euc-kr or cp949 instead of C which will also maintain the original filename character encoding for archive files created in Korean Windows.

Next we need to convert the gibberish filenames from EUC-KR/CP949 to UTF-8 using convmv, which according to the description on its man page:

converts filenames from one encoding to another 

[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
mv "/home/archjun/Downloads/Korean Writers(2009)/������(����).doc.docx" "/home/archjun/Downloads/Korean Writers(2009)/편혜영(영문).doc.docx"
mv "/home/archjun/Downloads/Korean Writers(2009)/������.hwp" "/home/archjun/Downloads/Korean Writers(2009)/편혜영.hwp"
Ready!

In the -f (from language) flag, you can also use euc-kr and the filename conversion will work just fine. The -r flag tells convmv to convert all filenames recursively (all files in the directory or sub-directories).

The --notest flag must be added for convmv to actually overwrite the existing filenames. As you can see above, the � gibberish characters have been converted to readable Korean.

In Allen's original post referred to above, he makes the important point that if you just naively extract an archive that contains filenames encoded in non-UTF8 characters onto a system with a UTF-8 locale, the gibberish filenames will automatically be encoded as UTF-8 but still be unreadable. If this happens, you will not be able to convert the mangled filenames to UTF-8 because they are in UTF-8 already!

For example,

[archjun@arch Downloads]$ 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Extracting  Korean Writers(2009)/ÆíÇý¿µ.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578
[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ.hwp
Ready!convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/

In the case above, we didn't specify a character encoding for the extracted filenames, so 7z defaults to the character encoding in our locale, which is en_US.UTF-8

Because of this, when we try to use convmv to convert from a MS Windows character encoding to UTF-8, convmv tells us that the filenames are already in UTF-8 and therefore cannot be converted!

The same holds true for other archive extractors like unzip, file-roller, etc. So don't forget to preface the archive extraction command with env to create a modified environment and then set LANG to the proper encoding (whether it is euc-jp, euc-kr, shift_jis, gbk, etc.) so that the extracted filenames' original character encoding will be maintained, thereby enabling conversion with convmv!

Postscript 2014-12-21:

Once you have converted filenames from a Windows text encoding like euc-kr to UTF-8, you may also need to convert text within a pure text file (not a binary like .doc, .hwp, etc) created in a Windows environment into UTF-8.

The Linux command for converting text within a file to another encoding is iconv. Let's assume we have a file, someText.txt, that was created in Windows and that contains Korean characters encoded in euc-kr. To convert to UTF-8 you can invoke iconv with the following flags:

iconv -c -f euc-kr -t utf8 someText.txt > someTextUTF-8.txt

-c  Silently discard characters that cannot be converted instead of
      terminating when encountering such characters.

-f  from-encoding (input characters)

-t  to-encoding (output characters)

The invocation above reads in someText.txt in euc-kr encoding and redirects output to someTextUTF-8.txt in UTF-8 encoding.

2014년 7월 15일 화요일

smtp-cli - send email from the Command Line without a mail server

smtp-cli is a Perl script that enables you to send mail from the command line without having to set up a local mail server. It can interface with webmail smtp servers like smtp.gmail.com and supports TLS.

A list of optional flags including --user --pass --from --to --subject can be seen at the bottom of the smtp-cli Perl script on Github.

Some caveats when trying to send mail through Gmail:

1. When you use the flags --user username --pass foo you cannot use your regular PW for Gmail but must generate an application-specific password instead.

If you don't specify a password beforehand with --pass, you will be interactively prompted to enter your PW as soon as smtp-cli runs

Enter password for username@smtp.gmail.com :

2. If you want to set up a cron job to run smtp-cli, make sure you specify your app-specific password after the --pass flag, otherwise cron will not be able to handle the interactive prompt.

3. To be able to send multilingual emails, be sure to pass the flag
--charset=UTF-8


Here is an example invocation of smtp-cli that will send an email to my personal gmail account:

[archjun@arch smtp-cli]$ smtp-cli --verbose --host=smtp.gmail.com --enable-auth --user gojun077 --pass 123456789abcdefg --from gojun077@gmail.com --to gojun077@gmail.com --subject "testing smtp-cli script #2" --body-plain="This is the body. 이건 바디다." --charset=UTF-8

When the email arrives in my gmail inbox, it contains the following UTF-8 encoded text in the body:

This is the body. 이건 바디다.


I originally heard about a similar Perl script called sendemail (not to be confused with the mail server sendmail), but it hasn't been updated since 2009 and currently doesn't work due to an SSL bug.

smtp-cli, on the other hand, is comparatively recent with version 3.6 updated July 11, 2013 and even has a package available through AUR. It works great!