2014년 8월 26일 화요일

Learning Traditional Chinese characters used in Korean (hanja) using a USB writing tablet and a NDS Emulator

In my spare time, I have been studying the 214 Kangxi Radicals (한글: 부수 部首) to help learn traditional Chinese characters faster. Thus far I have focused on just being able to recognize and read the radicals, but I have not tried writing them until now.

Some Koreans in their 50's and 60's suggested that I add writing practice to my study regimen, and the online Chinese/Japanese character writing site Skritter also champions this approach. The nice thing about learning radicals is that they are the building blocks of most Chinese characters - even complex Chinese characters can be broken down into radical components, making memorization that much easier.

If you have a USB tablet that accepts stylus input (I use the Huion H610, which works pretty well in Linux), you don't have to practice writing Chinese characters with ink and paper; Skritter and lots of software titles for the Nintendo DS teach writing Chinese complete with character overlays and stroke order information. I personally think Skritter is superior to similar offline-only NDS software, but Skritter is relatively expensive, at $15/month. Also, Skritter only supports Japanese-style Chinese characters (Kanji), Traditional Chinese, and Simplified Chinese. Also the pronunciation that is taught is only valid for Chinese and Japanese.

Although Korean uses Traditional Chinese characters, it uses its own special pronunciation for them. There are also some hanja characters that are uniquely Korean. For example, the pronunciation marks 丷 (구결자 하) and 乛 (구결자 야) are (to my knowledge) used exclusively in Korea.

I thus searched for Nintendo DS titles teaching Korean-style Chinese characters, and came up with 3 titles:

ROM# 3659 Magic Cheonjamun DS (마법천자문 DS)
ROM# 3867 HanGeom DS (한검 DS)
ROM# 5523 Mabeop Cheonjamun DS2: Choehuui hanjamabeop (마법천자문 DS2: 최후의 한자마법)

Among these three, the last title is the best. The first title is not playable in the Desmume emulator (a black rectangle artifact shows up in the bottom screen of the emulator when running the ROM), the second title is mostly quiz-based and expect users to already know all ~3600 common Chinese characters tested in high school.

There is an adventure/story mode, but perhaps because of my age I found it to be uninteresting. Much more useful is the dictionary mode which has a list of all the Chinese characters covered in the official Korean exam divided into levels 8 ~ 1 (특급). I went through the list to find only the radical characters to practice with.

Here is a video screen capture of Mabeop Cheonjamun DS2 in action:



2014년 8월 18일 월요일

Fortune-telling programs in Linux: qtarot

Qtarot 0.5.8 is a nice graphical Python program using the QT framework that displays a tarot layout and explanations for each of the cards in a reading.

Here's a screenshot of a sample reading:


The default layout shown above is "ellipse" and double clicking on any card will bring up a separate window to the right which gives an explanation of the card given its orientation (i.e. right side up, upside-down) and position.

For someone like me who hasn't memorized all the cards and their interpretations, Qtarot is very user-friendly. There are some tweaks that users must make, however, to get good-looking cards with higher resolutions.

The default cards provided by the Qtarot github repo (/decks/coleman-white) are very low resolution, 99 x 170 pixels, which look pretty horrible in full screen mode. They are installed in the following local path:

/usr/share/qtarot/decks/coleman-white/

These are the standard Rider-Waite public domain1 images used on most Universal Tarot decks. There are many higher-resolution versions of these same images available on the Internet, however. I found 350 x 600 pixel images of the Rider-Waite cards from the site www.learntarot.com. Fortunately, the site uses a predictable numbering scheme for the cards, so I was able to get large images of all 78 cards using batch download flags for wget:

wget http://www.learntarot.com/bigjpgs/maj{00..21}.jpg

The above is an example of downloading all 22 major arcana cards using wget's batch mode.

The numbering scheme for the four suits pentacles, cups, swords, and wands is similarly predictable, making it easy to download the rest of the deck.

But Qtarot cannot handle .jpg images, only .png, so you need to batch convert the jpegs. ImageMagick to the rescue!

mogrify -format png *.jpg

This above command will convert all jpegs to png format.

Once you have downloaded all the images, you must rename them according to the numbering scheme used by Qtarot. The numbering scheme is as follows:

0~21  the major arcana (The Fool ~ The World)
22~35 the wands (Ace of Wands ~ King of Wands)
36~49 the cups (Ace of Cups ~ King of Cups)
50~63 the swords (Ace of Swords ~ King of Swords)
64~77 the pentacles (Ace of Pentacles ~ King of Pentacles)

To save you the trouble of converting and renaming all the files, I encourage you to just download the archive of renamed png files I have made available on Google Drive at this link. Of course, you can use any tarot images you like as long as you convert your images to png and rename the image files according to the numbering scheme above.

Now overwrite the existing low-res images in /usr/share/qtarot/decks/coleman-white/ with your desired hi-res images.

Finally, edit /usr/share/qtarot/decks/coleman-white/deck.ini to make the hi-res cards appear larger:

[Deck Skin]
definition=Rider Waite
author=Pamela Colman Smith
desc=The original illustrations for the most widely-known deck.
width=1.0
height=2.0

The default values of width and height are 0.5 and 1.0, respectively, but that is way too small. I have doubled these values above.

Qtarot hasn't yet been packaged for Ubuntu/Debian distros, but the package qtarot-git in the AUR for Archlinux users is available and maintained by ShadowKyogre, Qtarot's developer.





1As of 2012, the images from the Rider-Waite tarot are in the public domain as 70 years has passed since the death of the original UK copyright holder. http://en.wikipedia.org/wiki/Rider-Waite_tarot_deck#Copyright_status

2014년 8월 7일 목요일

HackerRank as a better platform for solving Euler Project problems

Project Euler has a great archive of algorithm-oriented programming problems, but Project Euler is started to look a bit dated as MOOCs featuring online autograders for coding assignments become commonplace and general programming challenge sites like CodeEval, HackerRank, checkIO, etc. have online judges built-in.

Let's look at Problem 1 from PE:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below 1000.

The input domain is sufficiently small for brute-force solutions to work quickly. In Pseudocode:

1. Go through all numbers between 1 and 999 inclusive one at a time.
2. If a number is divisible by 3 or 5, put it in a separate list A.
3. Once you have finished going through all the numbers [1, 2, 3 ..., 999], return the sum of A.

In Python 3:



The above solution works fine for such a small domain. Here's the same problem on HackerRank:

If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23.
Find the sum of all the multiples of 3 or 5 below N.
Input Format 
First line contains T that denotes the number of test cases. This is followed by T lines, each containing an integer, N.
Output Format 
For each test case, print an integer that denotes the sum of all the multiples of 3 or 5 below N.
Constraints
$$1\le T \le 10^5$$
$$1 \le N \le 10^9$$


Unlike the original problem from the Euler Project, the upper limit for N in the Hacker Rank version is much larger than 1000; it goes up to one billion! For really large N the naive approach we used before (checking divisibility by 3 or 5 for each number) will no longer work quickly enough as it has \(O(n)\) time complexity (linear time). Although the naive solution passes a few test cases from HackerRank's autograder, it fails cases with really large N.

Having larger input ranges is really great as it forces you to think of a better solution. Fortunately I had encountered a similar problem, Bullseye, from Google Code Jam 2013 Round 1A that required knowledge of how to analytically find the sum of an arithmetic progression.

You might recall from high school math class that if we have a finite sequence with a common difference between terms, i.e. $$\{3, 6, 9, 12, 15, 18\}$$ we can find \(S_n\), the sum of arithmetic progression, from the following formula:
$$S_n = \frac{(a_1 + a_n)n}{2}$$
where \(a_1\) is the first term and \(a_n\) is the n-th term in the progression. If we plug in \(a_1=3\), \(a_n=18\) from the 6-element sequence above, we get the sum \(\frac{21 \cdot 6}{2}=63\), which we can easily verify by manually adding up all six terms.

Also, the formula for calculating the n-th term in an arithmetic sequence or progression is given by:
$$a_n = a_1 + (n-1)d$$
where \(a_1\) is the first term in the sequence and \(d\) is the constant difference between terms.

Let's say we want to find the number of terms less than 1,000,000 in the arithmetic progression where the constant difference between terms is 5. We can find the number of terms as follows:
$$\lfloor\frac{(1000000 - 1)}{5}\rfloor = 199999$$
The bracket-like figures \(\lfloor\) and \(\rfloor\) denote rounding down, so the expression above is doing integer floor division.

We subtract 1 from 1000000 because one million is a multiple of 5, but the terms of the arithmetic progression must be less than one million. Therefore the number of terms in the arithmetic sequence \(\{5, 10, 15, 20, \cdots, 999990, 999995\}\) is 199999.

Using the formulas above, we can now solve the original problem in \(O(1)\) constant time. Here is the new version stated in Pseudocode:

1. Given an upper limit lim, find the sum of all numbers less than lim that are divisible by 3 and the sum of all numbers less than lim divisible by 5.

2. Subtract the sum of all numbers less than lim that are divisible by 15 (to account for double counting; i.e. multiples of 15 will be included in the arithmetic progression {3, 6, 9, 12, 15, ... 30, ..., lim} as well as in {5, 10, 15, 20, 25, 30, ..., lim} so we need to subtract the arithmetic progression {15, 30, 45, 60, 75, ..., lim} to remedy this)

3. Return the overall sum

Here is the implementation in Python 3:



Warning: It would not be a good idea to simply copy-and-paste the code above into the HackerRank autograder! The autograder looks for duplicate submissions and red flags non-unique solutions (i.e. answers that have already been submitted)!

2014년 8월 4일 월요일

Webscraping pages containing non-UTF-8 CJK text with BeautifulSoup 4 뷰티플수프로 텍스트 추출하기

In the past, I have written some posts (JS: 오른쪽 마우스 클릭 차단 그만! and JS: 오른쪽 마우스 클릭 차단 Pt II) in Korean about how to apply Greasemonkey user scripts to non-programmatically copy text from web pages which have disabled right-click and copy-paste.

The blogging platforms popular in Korea (Naver, Daum/Tistory, etc) are referred to as "카페" (cafés) and generally block right-click (oncontextmenu) by default. The content I am interested in copying from some of these blog cafés includes public-domain classical Korean poems for which the copyright obviously doesn't reside with the blog itself. Such content shouldn't be locked up behind Javascript that disables basic browsing features.

The problem with Greasemonkey user scripts like Anti-Disabler, however, is that they don't work on all sites and aren't updated often enough to deal with changes in the anti-copying Javascript plugins from Daum and Naver.

Here's where BeautifulSoup comes in handy. Using bs4 (BeautifulSoup 4.2.0) for Python 3 (which uses UTF-8 by default, great for CJK), I scraped an article from a Korean news site as well as from a locked-down blog. Here's some sample code:

Korean news sites use tons of banner ads reminiscent of densely packed neon lights from entertainment districts in Gangnam, Hong Kong, or Tokyo along with funky CSS layout that sometimes make it hard to copy-and-paste an entire article. With Beautiful Soup, we can avoid all the bling and just get pure text.

Here's the .get_text() output of the whole article from Chosun.com:


Beautiful Soup also works great on right-click disabled web pages. Here's a snippet of text from an article about SEO for the Korean search engine Naver:


Note: Beware of possible encoding problems when you save .html files locally and try to parse them with BeautifulSoup using the open() method. Many webpages written in Chinese Japanese Korean (CJK) are still not encoded in UTF-8, instead using older formats such as SHIFT JIS, GBK, EUC-KR, and various Code Pages for Asian languages. These encodings are properly detected and decoded by BeautifulSoup, but the problem occurs when your system locale differs from the encoding of the .html file you are trying to save.

For example, my desktop Linux system uses en_US.UTF-8 for LANG and LC_... settings. Therefore when I save a text file with a non-UTF-8 encoding like EUC-KR, it is automatically saved as en_US.UTF-8, the current locale! The problem is that the EUC-KR encodings are invalid as UTF-8, so when you try to parse the .html file with BeautifulSoup, you will get the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x0x... in position 123: invalid start byte

Since the file has been saved as UTF-8, BeautifulSoup expects to find that encoding, but chokes when it finds EUC-KR instead. When opening a URL, by contrast, BeautifulSoup doesn't run into this problem of inconsistent encodings.

I have yet to succeed at using BeautifulSoup on a EUC-KR encoded webpage saved locally with that encoding. In Emacs, I specify the encoding for the file to be saved with C-x C-m f RET euc-kr RET but when I run file --mime localFile.html the console tells me the file is encoded as Latin-1 iso-8859-1!