1 Star 0 Fork 0

yangxin/SubCharTokenization

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
convert_to_byte.py 1.12 KB
一键复制 编辑 原始数据 按行查看 历史
NoviScl 提交于 2021-12-22 21:11 . push
from io import open
import pickle
wubi2ch = "data/wubi_to_chinese.pkl"
ch2wubi = "data/chinese_to_wubi.pkl"
def load_dict(dict_path):
return pickle.load(open(dict_path, "rb"))
with open("byte_char_map.pkl", "rb") as f:
byte_char_map = pickle.load(f)
SEP = chr(ord('_')+50000)
with open('/data2/private/clsi/wubi_corpus_orig/formatted/baidubaike_corpus.txt', 'r') as f:
with open('/data2/private/clsi/wubi_corpus_byte/formatted/baidubaike_corpus.txt', 'w+') as fw:
line = f.readline()
idx = 0
while line:
idx += 1
newline = ''
for c in line.strip():
c = bytes(c, 'utf-8')
for byte_index in c:
ch = byte_char_map[byte_index]
newline += ch
newline += SEP
newline += '\n'
fw.write(newline)
line = f.readline()
if idx % 400000 == 0:
print (idx)
print (newline)
# with open('/data2/private/clsi/wubi_corpus_byte/formatted/baidubaike_corpus.txt', 'r') as f:
# print (f.readline())
## tmux 12
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/yx75/SubCharTokenization.git
git@gitee.com:yx75/SubCharTokenization.git
yx75
SubCharTokenization
SubCharTokenization
main

搜索帮助

23e8dbc6 1850385 7e0993f3 1850385