1 Star 2 Fork 1

enthusiasmForever/贝叶斯新闻分类

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
parseurl.py 1.46 KB
一键复制 编辑 原始数据 按行查看 历史
贰拾贰画生 提交于 2015-12-29 20:26 . origin
#!/usr/bin/python
# -*- coding:utf-8 -*-
import re
import urllib2
from bs4 import BeautifulSoup
def getTextFromUrl(url, index):
if index > 0 and index < 7:
return getText1_6(url)
elif index == 7:
return getText7(url)
def getText1_6(url):
response = urllib2.urlopen(url)
html = response.read()
pattern = re.compile(r'<P style="TEXT-INDENT: 2em">(.*?)</P>')
articlePs = re.findall(pattern, html)
text = ''
if len(articlePs) >= 1:
for p in xrange(0, len(articlePs)):
articleP = articlePs[p].decode('gbk').encode('utf-8')
enPattern = re.compile(r'<.*?>')
ens = re.findall(enPattern, articleP)
if len(ens) > 0:
for i in xrange(0, len(ens)):
articleP = articleP.replace(ens[i], "")
text = text + articleP
return text
def getText7(url):
response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
text_ = str(soup.find_all('div', id='artical_real'))
pattern = re.compile(r'<p>(.*?)</p>')
ps = re.findall(pattern, text_)
text = ''
if len(ps) != 0:
#判断有几段p,不为零就记录
for p in ps:
#删除p中所有的<.*?>
enPattern = re.compile(r'<.*?>')
ens = re.findall(enPattern, p)
if len(ens) > 0:
for en in ens:
p = p.replace(en, '')
text = text + p
return text
马建仓 AI 助手
尝试更多
代码解读
代码找茬
代码优化
1
https://gitee.com/Nannnnnn1989/bayesian-news-classification.git
git@gitee.com:Nannnnnn1989/bayesian-news-classification.git
Nannnnnn1989
bayesian-news-classification
贝叶斯新闻分类
master

搜索帮助