1 Star 1 Fork 2

魏泽桦/patent_system

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
patent_LDA.py 2.35 KB
一键复制 编辑 原始数据 按行查看 历史
aimerin 提交于 2021-04-26 16:47 . 2021-04-26 简单实现了LDA模型
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from pprint import pprint
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
### 使用sklearn训练LDA模型
###
###
with open('./dataset/hit_stopwords.txt', 'r', encoding="utf-8") as f:
line = f.read()
line = line.split('","')
f.close()
stopwords = []
for l in line:
stopwords.append(l.strip())
with open('./segment/segment_CN102804030A.txt','r',encoding='utf-8') as f:
res1=' '.join(f.read().split('\n'))
f.close()
with open('./segment/segment_CN106607772A.txt','r',encoding='utf-8') as f:
res2=' '.join(f.read().split('\n'))
f.close()
with open('./segment/segment_CN111024533A.txt','r',encoding='utf-8') as f:
res3=' '.join(f.read().split('\n'))
f.close()
def load_data():
with open('./segment/segment_03_26.txt','r',encoding='utf-8') as f:
sentence_list=f.readlines()
f.close()
return sentence_list
if __name__ == '__main__':
corpus=load_data()
cntVector=CountVectorizer(stop_words=stopwords)
cntTf=cntVector.fit_transform(corpus)
lda=LatentDirichletAllocation(n_topics=4,max_iter=5,
learning_method='online',
learning_offset=50,
random_state=100)
docres=lda.fit_transform(cntTf)
vocab = cntVector.get_feature_names()
n_top_words = 5
topic_words = {}
pprint(docres)
pprint(lda.components_)
for topic, comp in enumerate(lda.components_):
# for the n-dimensional array "arr":
# argsort() returns a ranked n-dimensional array of arr, call it "ranked_array"
# which contains the indices that would sort arr in a descending fashion
# for the ith element in ranked_array, ranked_array[i] represents the index of the
# element in arr that should be at the ith index in ranked_array
# ex. arr = [3,7,1,0,3,6]
# np.argsort(arr) -> [3, 2, 0, 4, 5, 1]
# word_idx contains the indices in "topic" of the top num_top_words most relevant
# to a given topic ... it is sorted ascending to begin with and then reversed (desc. now)
word_idx = np.argsort(comp)[::-1][:n_top_words]
# store the words most relevant to the topic
topic_words[topic] = [vocab[i] for i in word_idx]
print(topic_words)
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
Python
1
https://gitee.com/aimerin/patent_system.git
git@gitee.com:aimerin/patent_system.git
aimerin
patent_system
patent_system
master

搜索帮助