专栏名称: 大数据文摘
普及数据思维,传播数据文化
目录
相关文章推荐
人工智能与大数据技术  ·  两周生成1.2万行代码!10年码龄开发者对A ... ·  昨天  
InfoTech  ·  2025年工信部职业技术/专项技术认证 ·  昨天  
CDA数据分析师  ·  被统计公式劝退?这门极简课程让你14天学会用 ... ·  2 天前  
数据派THU  ·  Python实现时间序列动量策略:波动率标准 ... ·  2 天前  
51好读  ›  专栏  ›  大数据文摘

手把手 | 教你爬下100部电影数据:R语言网页爬取入门指南

大数据文摘  · 公众号  · 大数据  · 2017-04-25 21:33

正文

请到「今天看啥」查看全文


现在您可以清除选择器部分并选择所有标题。您可以直观地检查所有标题是否被选中。使用您的光标进行任何所需的添加和删除。我在这里做了同样的事情。

步骤6: 再一次,我有了相应标题的CSS选择器-- .lister-item-header a。我将使用该选择器和以下代码爬取所有标题。

#使用CSS选择器来爬取标题部分

title_data_html

#将标题数据转化为文本

title_data

#让我们来看看标题

head(title_data)

[1] "Sing"          "Moana"         "Moonlight"     "Hacksaw Ridge"

[5] "Passengers"    "Trolls"


步骤7: 在下面的代码中,我对Description、Runtime、Genre、Rating、Metascore、Votes、Gross_Earning_in_Mil、Director和Actor数据做了同样的操作。

#使用CSS选择器来爬取描述部分

description_data_html

#将描述数据转化为文本

description_data

#让我们来看看描述数据

head(description_data)

[1] "\nIn a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "\nIn Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "\nA chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "\nWWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "\nA spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "\nAfter the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends.

#Data-Preprocessing: removing '\n'

#数据预处理:去掉'\n'

description_data

#Let's have another look at the description data

#我们再来看看描述数据

head(description_data)

[1] "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing competition becomes grander than he anticipates even as its finalists' find that their lives will never be the same."

[2] "In Ancient Polynesia, when a terrible curse incurred by the Demigod Maui reaches an impetuous Chieftain's daughter's island, she answers the Ocean's call to seek out the Demigod to set things right."

[3] "A chronicle of the childhood, adolescence and burgeoning adulthood of a young, African-American, gay man growing up in a rough neighborhood of Miami."

[4] "WWII American Army Medic Desmond T. Doss, who served during the Battle of Okinawa, refuses to kill people, and becomes the first man in American history to receive the Medal of Honor without firing a shot."

[5] "A spacecraft traveling to a distant colony planet and transporting thousands of people has a malfunction in its sleep chambers. As a result, two passengers are awakened 90 years early."

[6] "After the Bergens invade Troll Village, Poppy, the happiest Troll ever born, and the curmudgeonly Branch set off on a journey to rescue her friends."

#使用CSS选择器来爬取电影时长部分

runtime_data_html

#将时长数据转化为文本

runtime_data

#让我们来看看时长

head(runtime_data)

[1] "108 min" "107 min" "111 min" "139 min" "116 min" "92 min"

#数据预处理:去掉'mins',并转换为数字格式

runtime_data

runtime_data

#我们再来看看时长数据

head(rank_data)

[1] 1 2 3 4 5 6

#使用CSS选择器来爬取电影类型部分

genre_data_html

#将类型数据转化为文本

genre_data

#让我们来看看类型

head(genre_data)

[1] "\nAnimation, Comedy, Family "

[2] "\nAnimation, Adventure, Comedy "

[3] "\nDrama "

[4] "\nBiography, Drama, History "

[5] "\nAdventure, Drama, Romance "

[6] "\nAnimation, Adventure, Comedy "

#数据预处理:去掉'\n'

genre_data

#数据预处理:去掉多余的空格

genre_data

#只选取每一部电影的第一种类型

genre_data

#将每种类型从文本转换为因子

genre_data

#我们再来看看类型数据

head(genre_data)

[1] Animation Animation Drama     Biography Adventure Animation







请到「今天看啥」查看全文