2014년 7월 22일 화요일

How to fix broken CJK filenames extracted from zip archive created in MS Windows environment

Most computers run some form of MS Windows, which doesn't natively use UTF-8 for character encoding. This can cause problems for Linux users who have to work with filenames using East Asian CJK (Chinese Japanese Korean) characters from a Windows environment. For single files sent as email attachments through Gmail, Google is smart enough to detect what code page the filename is encoded in and convert it to UTF-8 when the attachment is downloaded to a POSIX environment.

For archive files like .zip, however, compressed files named using CJK characters in a MS Windows environment will appear as gibberish in a UTF-8 locale.

Rather than booting up a Windows VM just to extract files from an archive, a faster method is to extract the compressed files while maintaining their original filename character encoding.

The following example will use this .zip file that was created on a Korean version of MS Windows. Korean language characters on Windows are encoded using Code Page 949, which is compatible with EUC-KR, the most-widely used character encoding in Korea.

First I will extract the file using 7z from the CLI, but create a modified environment with a different language encoding by using env and the LANG=... flag. This method was first described by developer Allen Choong in this post from 2013 in which he details converting filenames encoded in MS Windows GBK (Code Page 936) Simplified Chinese to UTF-8 after extraction from an archive file.

[archjun@arch Downloads]$ env LANG=C 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/������(����).doc.docx
Extracting  Korean Writers(2009)/������.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578

You can see that the filenames extracted from the archive are mangled, as they have a non-UTF8 character encoding.

Note that the LANG variable can also be set to euc-kr or cp949 instead of C which will also maintain the original filename character encoding for archive files created in Korean Windows.

Next we need to convert the gibberish filenames from EUC-KR/CP949 to UTF-8 using convmv, which according to the description on its man page:

converts filenames from one encoding to another 

[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
mv "/home/archjun/Downloads/Korean Writers(2009)/������(����).doc.docx" "/home/archjun/Downloads/Korean Writers(2009)/편혜영(영문).doc.docx"
mv "/home/archjun/Downloads/Korean Writers(2009)/������.hwp" "/home/archjun/Downloads/Korean Writers(2009)/편혜영.hwp"
Ready!

In the -f (from language) flag, you can also use euc-kr and the filename conversion will work just fine. The -r flag tells convmv to convert all filenames recursively (all files in the directory or sub-directories).

The --notest flag must be added for convmv to actually overwrite the existing filenames. As you can see above, the � gibberish characters have been converted to readable Korean.

In Allen's original post referred to above, he makes the important point that if you just naively extract an archive that contains filenames encoded in non-UTF8 characters onto a system with a UTF-8 locale, the gibberish filenames will automatically be encoded as UTF-8 but still be unreadable. If this happens, you will not be able to convert the mangled filenames to UTF-8 because they are in UTF-8 already!

For example,

[archjun@arch Downloads]$ 7z x 편혜영.zip

7-Zip [64] 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)

Processing archive: 편혜영.zip

Extracting  Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Extracting  Korean Writers(2009)/ÆíÇý¿µ.hwp

Everything is Ok

Files: 2
Size:       33991
Compressed: 25578
[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ.hwp
Ready!convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/

In the case above, we didn't specify a character encoding for the extracted filenames, so 7z defaults to the character encoding in our locale, which is en_US.UTF-8

Because of this, when we try to use convmv to convert from a MS Windows character encoding to UTF-8, convmv tells us that the filenames are already in UTF-8 and therefore cannot be converted!

The same holds true for other archive extractors like unzip, file-roller, etc. So don't forget to preface the archive extraction command with env to create a modified environment and then set LANG to the proper encoding (whether it is euc-jp, euc-kr, shift_jis, gbk, etc.) so that the extracted filenames' original character encoding will be maintained, thereby enabling conversion with convmv!

Postscript 2014-12-21:

Once you have converted filenames from a Windows text encoding like euc-kr to UTF-8, you may also need to convert text within a pure text file (not a binary like .doc, .hwp, etc) created in a Windows environment into UTF-8.

The Linux command for converting text within a file to another encoding is iconv. Let's assume we have a file, someText.txt, that was created in Windows and that contains Korean characters encoded in euc-kr. To convert to UTF-8 you can invoke iconv with the following flags:

iconv -c -f euc-kr -t utf8 someText.txt > someTextUTF-8.txt

-c  Silently discard characters that cannot be converted instead of
      terminating when encountering such characters.

-f  from-encoding (input characters)

-t  to-encoding (output characters)

The invocation above reads in someText.txt in euc-kr encoding and redirects output to someTextUTF-8.txt in UTF-8 encoding.