专栏名称: 程序员大咖

为程序员提供最优质的博文、最精彩的讨论、最实用的开发资源；提供最新最全的编程学习资料：PHP、Objective-C、Java、Swift、C/C++函数库、.NET Framework类库、J2SE API等等。并不定期奉送各种福利。

Python爬虫实战：《战狼2》豆瓣影评分析

程序员大咖 · 公众号 · 程序员 · 2018-04-23 10:24

正文

请到「今天看啥」查看全文

urlopen ( requrl )

html_data = resp . read (). decode ( 'utf-8' )

soup = bs ( html_data , 'html.parser' )

comment_div_lits = soup . find_all ( 'div' , class_ = 'comment' )

此时在 comment_div_lits 列表中存放的就是div标签和comment属性下面的html代码了。在上图中还可以发现在p标签下面存放了网友对电影的评论，如下图所示:

因此对 comment_div_lits 代码中的html代码继续进行解析，代码如下：

eachCommentList = []; 
for item in comment_div_lits: 
        if item.find_all('p')[0].string is not None:     
            eachCommentList.append(item.find_all('p')[0].string)

使用 print ( eachCommentList ) 查看eachCommentList列表中的内容，可以看到里面存里我们想要的影评。如下图所示：

好的，至此我们已经爬取了豆瓣最近播放电影的评论数据，接下来就要对数据进行清洗和词云显示了。

二、数据清洗

为了方便进行数据进行清洗，我们将列表中的数据放在一个字符串数组中，代码如下：

comments = ''
for k in range(len(eachCommentList)):
    comments = comments + (str(eachCommentList[k])).strip()

使用 print ( comments ) 进行查看，如下图所示：

可以看到所有的评论已经变成一个字符串了，但是我们发现评论中还有不少的标点符号等。这些符号对我们进行词频统计时根本没有用，因此要将它们清除。所用的方法是正则表达式。python中正则表达式是通过re模块来实现的。代码如下：

import




    
 re
pattern = re.compile(r'[u4e00-u9fa5]+')
filterdata = re.findall(pattern, comments)
cleaned_comments = ''.join(filterdata)

继续使用 print ( cleaned_comments ) 语句进行查看，如下图所示：

我们可以看到此时评论数据中已经没有那些标点符号了，数据变得"干净"了很多。

因此要进行词频统计，所以先要进行中文分词操作。在这里我使用的是结巴分词。如果没有安装结巴分词，可以在控制台使用 pip install jieba 进行安装。（注：可以使用 pip list 查看是否安装了这些库）。代码如下所示：

import jieba    #分词包
import pandas as pd  
segment = jieba.lcut(cleaned_comments)
words_df=pd.DataFrame({'segment':segment})

因为结巴分词要用到pandas，所以我们这里加载了pandas包。可以使用 words_df . head () 查看分词之后的结果，如下图所示：

从上图可以看到我们的数据中有"看"、"太"、"的"等虚词（停用词），而这些词在任何场景中都是高频时，并且没有实际的含义，所以我们要他们进行清除。

我把停用词放在一个 stopwords . txt 文件中，将我们的数据与停用词进行比对即可（注：只要在百度中输入 stopwords . txt ，就可以下载到该文件）。去停用词代码如下代码如下：

stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="t",names=['stopword'], encoding='utf-8')#quoting=3全不引用
words_df=words_df[~words_df.segment.isin(stopwords.stopword)]

继续使用 words_df .