判断文件的每一行内容是否包含非中文字符-python

xiaoxiao2021-03-25 282

有的时候，我们在做自然语言处理时，需要去除非中文字符，只保留文本中的中文字符。

#coding:utf-8 import sys import re reload(sys) sys.setdefaultencoding('utf-8') if len(sys.argv)!=3: print "ERROR *************" print "Usage:{0} <input_file> <output_file>".format(sys.argv[0]) INPUT_FILE=sys.argv[1] OUTPUT_FILE=sys.argv[2] def load_file(file_name): with open(file_name) as f: items=[x.strip() for x in f.readlines()] return items def is_Contain_non_Chinese(string): line = string.decode('utf-8', 'ignore') p2 = re.compile(ur'[^\u4e00-\u9fa5]') if p2.search(line): return True else: return False def get_clean_corpus(r_file,d_file): f_list=load_file(r_file) with open(d_file,'w') as f: for line in f_list: if not is_Contain_non_Chinese(line): f.write(line+'\n') if __name__ == '__main__': get_clean_corpus(INPUT_FILE,OUTPUT_FILE)

INPUT_FILE 为待处理的含有非中文字符的文本文件

OUTPUT_FILE 为处理过的只含有中文字符的文本文件

运行脚本方式： python test.py 输入文件名输出文件名

转载请注明原文地址: https://ju.6miu.com/read-532.html

技术

最新回复(0)