1 Star 0 Fork 0

yangxin/SubCharTokenization

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
convert_to_random_index.py 1.69 KB
一键复制 编辑 原始数据 按行查看 历史
NoviScl 提交于 2021-12-22 21:11 . push
from io import open
import pickle
import string
import random
random.seed(2021)
wubi2ch = "data/wubi_to_chinese.pkl"
ch2wubi = "data/chinese_to_wubi.pkl"
pinyin2ch = "data/pinyin_to_chinese.pkl"
ch2pinyin = "data/chinese_to_pinyin.pkl"
def load_dict(dict_path):
return pickle.load(open(dict_path, "rb"))
ch_chars = list(load_dict(ch2wubi).keys()) + list(load_dict(ch2pinyin).keys()) + list(string.punctuation)
ch_chars = list(set(ch_chars))
random.shuffle(ch_chars)
# print (len(ch_chars))
SEP = chr(ord('_')+50000)
# random_index_map = {}
# for i in range(len(ch_chars)):
# random_index_map[ch_chars[i]] = i + 10000
# print (list(random_index_map.keys())[:10])
# print (list(random_index_map.values())[:10])
# with open("random_index_map.pkl", 'wb') as f:
# pickle.dump(random_index_map, f)
with open("random_index_map.pkl", 'rb') as f:
random_index_map = pickle.load(f)
# print (list(random_index_map.keys())[:10])
# print (list(random_index_map.values())[:10])
with open('/data2/private/clsi/wubi_corpus_orig/formatted/baidubaike_corpus.txt', 'r') as f:
with open('/data2/private/clsi/wubi_corpus_random_index/formatted/baidubaike_corpus.txt', 'w+') as fw:
line = f.readline()
idx = 0
while line:
idx += 1
newline = ''
for c in line.strip():
if c in random_index_map:
newline += str(random_index_map[c])
else:
newline += c
newline += SEP
newline += '\n'
fw.write(newline)
line = f.readline()
if idx % 400000 == 0:
print (idx)
print (newline)
## tmux 10
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/yx75/SubCharTokenization.git
git@gitee.com:yx75/SubCharTokenization.git
yx75
SubCharTokenization
SubCharTokenization
main

搜索帮助

23e8dbc6 1850385 7e0993f3 1850385