1 Star 0 Fork 2

qinyukun/named_entity_recognition

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
文件
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
data.py 1.10 KB
一键复制 编辑 原始数据 按行查看 历史
luopeixiang 提交于 2019-04-09 18:57 . fix UnicodeEncodeError
from os.path import join
from codecs import open
def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):
"""读取数据"""
assert split in ['train', 'dev', 'test']
word_lists = []
tag_lists = []
with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:
word_list = []
tag_list = []
for line in f:
if line != '\n':
word, tag = line.strip('\n').split()
word_list.append(word)
tag_list.append(tag)
else:
word_lists.append(word_list)
tag_lists.append(tag_list)
word_list = []
tag_list = []
# 如果make_vocab为True,还需要返回word2id和tag2id
if make_vocab:
word2id = build_map(word_lists)
tag2id = build_map(tag_lists)
return word_lists, tag_lists, word2id, tag2id
else:
return word_lists, tag_lists
def build_map(lists):
maps = {}
for list_ in lists:
for e in list_:
if e not in maps:
maps[e] = len(maps)
return maps
Loading...
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/qinyukun/named_entity_recognition.git
git@gitee.com:qinyukun/named_entity_recognition.git
qinyukun
named_entity_recognition
named_entity_recognition
master

搜索帮助