Imagine that you have a 100-page .doc file with alternating lines of English and some foreign language. It is not feasible to manually cut-and-paste all the foreign language sentences into another file! Luckily, Python 3 exists and it is UTF-8 friendly, so we can easily manipulate English and all kinds of foreign languages within Python 3 programs.
My script is called deleteLanguage.py and it is available on github at https://github.com/gojun077/deleteLanguage.
It will take mixed text like (I didn't do this horrible translation into English, btw)
기업에서 왜 트리즈를 교육해야 하는가?
Why do the companies should educate TRIZ to their members?
오늘날 특히 기업에서의 연구개발은 문제를 해결하느냐 못하느냐의 문제가 아니다.
Today being able to solve the problems or not being isn’t a real problem in the corporation’s research and development.
얼마나 빨리 새로운 결과를 찾아내는 가에 따라 성공여부가 결정된다.
The success of them depends on how fast they can find the new solutions.
하지만 우리들은 문제를 더 빨리 혁신적으로 해결할 수 있는 방법을 공부한 적이 없다.
But we have never learned to solve problems faster and more innovative.
대부분의 많은 연구개발자들은 창의적인 문제해결이 무엇인지도 모른다.
Most researchers and engineers don’t even know what the creative method to solve the problems is.
오늘날도 많은 연구자들은 각자 기존의 경험과 지식을 바탕으로 열심히 생각하기를 한다.
Today they are thinking hard based on only their own experience and knowledge.
and parse it into separate English
Why do the companies should educate TRIZ to their members?
Today being able to solve the problems or not being a real problem in the research and development.
The success of them depends on how fast they can find the new solutions.
But we have never learned to solve problems faster and more innovative.
Most researchers and engineers even know what the creative method to solve the problems is.
Today they are thinking hard based on only their own experience and knowledge.
and non-English output:
기업에서 왜 트리즈를 교육해야 하는가?
오늘날 특히 기업에서의 연구개발은 문제를 해결하느냐 못하느냐의 문제가 아니다.
얼마나 빨리 새로운 결과를 찾아내는 가에 따라 성공여부가 결정된다.
하지만 우리들은 문제를 더 빨리 혁신적으로 해결할 수 있는 방법을 공부한 적이 없다.
대부분의 많은 연구개발자들은 창의적인 문제해결이 무엇인지도 모른다.
오늘날도 많은 연구자들은 각자 기존의 경험과 지식을 바탕으로 열심히 생각하기를 한다.
The version in the initial commit has the following limitations:
The script will assume that non-ASCII characters not included in string.printable (from the Python string module) are non-English characters, so the following strings
'Awesome☻'
'...noted.†'
would not be detected as 'English' by the script.
In non-English sentences containing a the occasional English word, the script just omits these words entirely. Consider the following Korean sentence:
"철수씨는 IBM에 근무한다."
deleteLanguage.py as it is currently implemented will parse the above snippet into the following when it outputs the non-English only text file:
"철수씨는 근무한다."
The '에' character adjoining IBM is deleted along with the English word.
I haven't yet thought up a sure-fire algorithm to avoid this problem; creating prescriptive rules for dozens of one-off cases doesn't seem to be the solution, either.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This Python3 script takes UTF-8 encoded text files as input and writes out | |
# two files: (1) only English text (2) only non-English text | |
FILEIN = "/home/archjun/Downloads/EngKor.txt" | |
FILEOUTeng = "/home/archjun/Downloads/EngOnly.txt" | |
FILEOUTkor = "/home/archjun/Downloads/KorOnly.txt" | |
WHITELIST = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c“”’' | |
ENGLISH = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' | |
import doctest | |
def load_file(): | |
""" | |
Open textfile and append each line as a string inside a list | |
""" | |
input_list = [] | |
with open(FILEIN, 'r') as infile: | |
for line in infile: | |
line = line.strip('\n') | |
if line: | |
input_list.append(line) | |
return input_list | |
def winnowLos(los): | |
""" | |
ListOfString -> ListOf(ListOfString) | |
Given los where each list element is a long string containing spaces | |
and multiple words, decompose each long string into a sublist of words. | |
Returns a List of sublists of strings | |
>>> winnowLos(['시간분리는 물리모순을 시간의 측면에서 독립적으로 바라보는 것이다.']) | |
[['시간분리는', '물리모순을', '시간의', '측면에서', '독립적으로', '바라보는', '것이다.']] | |
>>> winnowLos(['Who are you?', "I'm yomama"]) | |
[['Who', 'are', 'you?'], ["I'm", 'yomama']] | |
""" | |
lolos = [] | |
for line in los: | |
lolos.append(line.split()) | |
return lolos | |
def rmNonASCII(lol): | |
""" | |
ListOf(ListOfString) -> ListOfString | |
Given lolos where each element is a list of individual words, | |
(1) check each word in each sublist for non-ASCII chars | |
(2) if non-ASCII chars are detected, delete the word | |
(3) append ASCII words to a new sublist | |
(4) append ASCII-only sublists to a new list | |
(5) return ASCII-only ListOf(ListOfStrings) | |
>>> rmNonASCII([['ABC가나다', 'EFghijk라마바사'], ['ABC', 'def']]) | |
[[], ['ABC', 'def']] | |
>>> rmNonASCII([['가나다!', '라마바사.', 'wow'], ['Can', 'you', 'hear', 'me?']]) | |
[['wow'], ['Can', 'you', 'hear', 'me?']] | |
>>> rmNonASCII([['', 'F^K', '행복']]) | |
[['', 'F^K']] | |
>>> rmNonASCII([['Once', 'upon'], ['행복', 'a', '“time”']]) | |
[['Once', 'upon'], ['a', '“time”']] | |
""" | |
asciiLol = [] #ListOf(ListOfString) containing only ASCII words | |
for sublist in lol: | |
cleanLine = [] | |
for word in sublist: | |
allASCII = True | |
for char in word: | |
if not char in WHITELIST: | |
allASCII = False | |
break | |
if allASCII: | |
cleanLine.append(word) | |
asciiLol.append(cleanLine) | |
return asciiLol | |
def rmEnglish(lol): | |
""" | |
ListOf(ListOfString) -> ListOfString | |
Given lolos where each element is a list of individual words, | |
(1) check each word in each sublist for English chars | |
(2) if English chars are detected, delete the word | |
(3) append non-English words to a new sublist | |
(4) append non-English-only sublists to a new list | |
(5) return non-English-only ListOf(ListOfStrings) | |
>>> rmEnglish([['ABC가나다', 'EFghijk라마바사'], ['가나다', '라마바']]) | |
[[], ['가나다', '라마바']] | |
>>> rmEnglish([['가나다!', '라마바사.', 'wow'], ['Can', 'you', 'hear', 'me?']]) | |
[['가나다!', '라마바사.'], []] | |
>>> rmEnglish([['', 'F^K', '행복']]) | |
[['', '행복']] | |
>>> rmEnglish([['Once', 'upon'], ['행복한', '“인생”']]) | |
[[], ['행복한', '“인생”']] | |
""" | |
nonEngLol = [] #ListOf(ListOfString) containing only non-ASCII words | |
for sublist in lol: | |
cleanLine = [] | |
for word in sublist: | |
noEnglish = True | |
for char in word: | |
if char in ENGLISH: | |
noEnglish = False | |
break | |
if noEnglish: | |
cleanLine.append(word) | |
nonEngLol.append(cleanLine) | |
return nonEngLol | |
##MAIN PROGRAM## | |
doctest.testmod() | |
incoming = load_file() | |
process1 = winnowLos(incoming) | |
loASCII = rmNonASCII(process1) | |
loNonEnglish = rmEnglish(process1) | |
# write only ASCII text to FILEOUTeng | |
with open(FILEOUTeng, 'w') as outputF: | |
for line in loASCII: | |
outputF.write(" ".join(line) + '\n') | |
#write only non-ASCII text to FILEOUTkor | |
with open(FILEOUTkor, 'w') as outputF: | |
for line in loNonEnglish: | |
outputF.write(" ".join(line) + '\n') |
댓글 없음:
댓글 쓰기