Most computers run some form of MS Windows, which doesn't natively use UTF-8 for character encoding. This can cause problems for Linux users who have to work with filenames using East Asian CJK (
Chinese
Japanese
Korean) characters from a Windows environment. For single files sent as email attachments through Gmail, Google is smart enough to detect what
code page the filename is encoded in and convert it to UTF-8 when the attachment is downloaded to a POSIX environment.
For archive files like
.zip, however, compressed files named using CJK characters in a MS Windows environment will appear as gibberish in a UTF-8 locale.
Rather than booting up a Windows VM just to extract files from an archive, a faster method is to extract the compressed files while maintaining their original filename character encoding.
The following example will use
this .zip file that was created on a Korean version of MS Windows. Korean language characters on Windows are encoded using
Code Page 949, which is compatible with
EUC-KR, the most-widely used character encoding in Korea.
First I will extract the file using
7z from the CLI, but create a modified environment with a different language encoding by using
env and the
LANG=... flag. This method was first described by developer Allen Choong in
this post from 2013 in which he details converting filenames encoded in MS Windows
GBK (
Code Page 936) Simplified Chinese to UTF-8 after extraction from an archive file.
[archjun@arch Downloads]$ env LANG=C 7z x 편혜영.zip
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=C,Utf16=off,HugeFiles=on,2 CPUs)
Processing archive: 편혜영.zip
Extracting Korean Writers(2009)/������(����).doc.docx
Extracting Korean Writers(2009)/������.hwp
Everything is Ok
Files: 2
Size: 33991
Compressed: 25578
You can see that the filenames extracted from the archive are mangled, as they have a non-UTF8 character encoding.
Note that the LANG variable can also be set to euc-kr or cp949 instead of C which will also maintain the original filename character encoding for archive files created in Korean Windows.
Next we need to convert the gibberish filenames from EUC-KR/CP949 to UTF-8 using
convmv, which according to the description on its
man page:
converts filenames from one encoding to another
[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
mv "/home/archjun/Downloads/Korean Writers(2009)/������(����).doc.docx" "/home/archjun/Downloads/Korean Writers(2009)/편혜영(영문).doc.docx"
mv "/home/archjun/Downloads/Korean Writers(2009)/������.hwp" "/home/archjun/Downloads/Korean Writers(2009)/편혜영.hwp"
Ready!
In the -f (from language) flag, you can also use euc-kr and the filename conversion will work just fine. The -r flag tells convmv to convert all filenames recursively (all files in the directory or sub-directories).
The --notest flag must be added for convmv to actually overwrite the existing filenames. As you can see above, the � gibberish characters have been converted to readable Korean.
In Allen's original post referred to above, he makes the important point that if you just naively extract an archive that contains filenames encoded in non-UTF8 characters onto a system with a UTF-8 locale, the gibberish filenames will automatically be encoded as UTF-8 but still be unreadable. If this happens, you will not be able to convert the mangled filenames to UTF-8 because they are in UTF-8 already!
For example,
[archjun@arch Downloads]$ 7z x 편혜영.zip
7-Zip [64] 9.20 Copyright (c) 1999-2010 Igor Pavlov 2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,2 CPUs)
Processing archive: 편혜영.zip
Extracting Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Extracting Korean Writers(2009)/ÆíÇý¿µ.hwp
Everything is Ok
Files: 2
Size: 33991
Compressed: 25578
[archjun@arch Downloads]$ convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ(¿µ¹®).doc.docx
Skipping, already UTF-8: /home/archjun/Downloads/Korean Writers(2009)/ÆíÇý¿µ.hwp
Ready!convmv -f cp949 -t utf8 -r --notest ~/Downloads/"Korean Writers(2009)"/
In the case above, we didn't specify a character encoding for the extracted filenames, so 7z defaults to the character encoding in our locale, which is en_US.UTF-8
Because of this, when we try to use convmv to convert from a MS Windows character encoding to UTF-8, convmv tells us that the filenames are already in UTF-8 and therefore cannot be converted!
The same holds true for other archive extractors like unzip, file-roller, etc. So don't forget to preface the archive extraction command with env to create a modified environment and then set LANG to the proper encoding (whether it is euc-jp, euc-kr, shift_jis, gbk, etc.) so that the extracted filenames' original character encoding will be maintained, thereby enabling conversion with convmv!
Postscript 2014-12-21:
Once you have converted filenames from a Windows text encoding like euc-kr to UTF-8, you may also need to convert text within a pure text file (not a binary like .doc, .hwp, etc) created in a Windows environment into UTF-8.
The Linux command for converting text within a file to another encoding is iconv. Let's assume we have a file, someText.txt, that was created in Windows and that contains Korean characters encoded in euc-kr. To convert to UTF-8 you can invoke iconv with the following flags:
iconv -c -f euc-kr -t utf8 someText.txt > someTextUTF-8.txt
-c Silently discard characters that cannot be converted instead of
terminating when encountering such characters.
-f from-encoding (input characters)
-t to-encoding (output characters)
The invocation above reads in someText.txt in euc-kr encoding and redirects output to someTextUTF-8.txt in UTF-8 encoding.