手把手教你写网络爬虫（4）：Scrapy入门

会员服务 ·

手把手教你写网络爬虫（4）：Scrapy入门

2018 年 5 月 22 日 数盟

本系列：

《手把手教你写网络爬虫（1）：网易云音乐歌单》

《手把手教你写网络爬虫（2）：迷你爬虫架构》

《手把手教你写网络爬虫（3）：开源爬虫框架对比》

上期我们理性的分析了为什么要学习Scrapy，理由只有一个，那就是免费，一分钱都不用花！

咦？怎么有人扔西红柿？好吧，我承认电视看多了。不过今天是没得看了，为了赶稿，又是一个不眠夜。。。言归正传，我们将在这一期介绍完Scrapy的基础知识，如果想深入研究，大家可以参考官方文档，那可是出了名的全面，我就不占用公众号的篇幅了。

架构简介

下面是Scrapy的架构，包括组件以及在系统中发生的数据流的概览(红色箭头所示)。之后会对每个组件做简单介绍，数据流也会做一个简要描述。

架构就是这样，流程和我第二篇里介绍的迷你架构差不多，但扩展性非常强大。

One more thing

scrapy startproject tutorial

该命令将会创建包含下列内容的 tutorial 目录:

tutorial/

scrapy.cfg # 项目的配置文件

tutorial/ # 该项目的python模块。之后您将在此加入代码

__init__.py

items.py # 项目中的item文件

pipelines.py # 项目中的pipelines文件

settings.py # 项目的设置文件

spiders/ # 放置spider代码的目录

__init__.py

编写第一个爬虫

Spider是用户编写用于从单个网站(或者一些网站)爬取数据的类。其包含了一个用于下载的初始URL，以及如何跟进网页中的链接以及如何分析页面中的内容的方法。

以下为我们的第一个Spider代码，保存在 tutorial/spiders 目录下的 quotes_spider.py文件中:

import scrapy

class QuotesSpider(scrapy.Spider):

name = “quotes”

def start_requests(self):

urls = [

‘http://quotes.toscrape.com/page/1/’,

‘http://quotes.toscrape.com/page/2/’,

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split(“/”)[-2]
filename = ‘quotes-%s.html’ % page
with open(filename, ‘wb’) as f:
f.write(response.body)
self.log(‘Saved file %s’ % filename)

运行我们的爬虫

进入项目的根目录，执行下列命令启动spider:

scrapy crawl quotes

这个命令启动用于爬取 quotes.toscrape.com 的spider，你将得到类似的输出:

2017-05-10 20:36:17 [scrapy.core.engine] INFO: Spider opened

2017-05-10 20:36:17 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

2017-05-10 20:36:17 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023

2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)

2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

2017-05-10 20:36:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)

2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-1.html

2017-05-10 20:36:17 [quotes] DEBUG: Saved file quotes-2.html

2017-05-10 20:36:17 [scrapy.core.engine] INFO: Closing spider (finished)

提取数据

我们之前只是保存了HTML页面，并没有提取数据。现在升级一下代码，把提取功能加进去。至于如何使用浏览器的开发者模式分析网页，之前已经介绍过了。

import scrapy

class QuotesSpider(scrapy.Spider):

name = “quotes”

start_urls = [

‘http://quotes.toscrape.com/page/1/’,

‘http://quotes.toscrape.com/page/2/’,

]

def parse(self, response):

for quote in response.css(‘div.quote’):

yield {

‘text’: quote.css(‘span.text::text’).extract_first(),

‘author’: quote.css(‘small.author::text’).extract_first(),

‘tags’: quote.css(‘div.tags a.tag::text’).extract(),

}

再次运行这个爬虫，你将在日志里看到被提取出的数据：

2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>

{‘tags’: [‘life’, ‘love’], ‘author’: ‘André Gide’, ‘text’: ‘“It is better to be hated for what you are than to be loved for what you are not.”’}

2017-05-10 20:38:33 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>

{‘tags’: [‘edison’, ‘failure’, ‘inspirational’, ‘paraphrased’], ‘author’: ‘Thomas A. Edison’, ‘text’: ““I have not failed. I’ve just found 10,000 ways that won’t work.””}

保存爬取的数据

最简单存储爬取的数据的方式是使用 Feed exports:

scrapy crawl quotes -o quotes.json

该命令将采用 JSON 格式对爬取的数据进行序列化，生成quotes.json文件。

在类似本篇教程里这样小规模的项目中，这种存储方式已经足够。如果需要对爬取到的item做更多更为复杂的操作，你可以编写 Item Pipeline，tutorial/pipelines.py在最开始的时候已经自动创建了。

媒体合作请联系：

邮箱：xiangxiaoqing@stormorai.com

登录查看更多

相关内容

Scrapy

关注 26

Scrapy，Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。

【2020新书】实战R语言4，323页pdf

专知会员服务

102+阅读 · 2020年7月1日

一份简明有趣的Python学习教程，42页pdf

专知会员服务

77+阅读 · 2020年6月22日

【实用书】Python文本分析第二版，688页pdf带你入门自然语言处理

专知会员服务

162+阅读 · 2020年5月15日

干净的数据：数据清洗入门与实践，204页pdf

专知会员服务

164+阅读 · 2020年5月14日

【实用书】Python爬虫Web抓取数据，第二版，306页pdf

专知会员服务

122+阅读 · 2020年5月10日

【干货书】Python 3专业开发指南，468页pdf，Pro Python 3, 3rd Edition

专知会员服务

242+阅读 · 2020年4月1日

Tensorflow GNN实战：手把手教你使用tf_geometric构建图自编码器GAE（附完整代码）

专知会员服务

76+阅读 · 2020年1月30日

【书籍推荐】简洁的Python编程（Clean Python），附274页pdf

专知会员服务

183+阅读 · 2020年1月1日

《动手学深度学习》(Dive into Deep Learning)PyTorch实现

专知会员服务

121+阅读 · 2019年12月31日

【干货】大数据入门指南：Hadoop、Hive、Spark、 Storm等

专知会员服务

97+阅读 · 2019年12月4日

抖音爬虫

专知

3+阅读 · 2019年2月11日

手把手教你用R语言制作网络爬虫机器人（一）

R语言中文社区

4+阅读 · 2019年1月26日

Python网络爬虫与信息抽取笔记08 标签树的遍历

专知

3+阅读 · 2018年5月10日

干货 | Python 爬虫的工具列表大全

机器学习算法与Python学习

10+阅读 · 2018年4月13日

Python 爬虫实践：《战狼2》豆瓣影评分析

数据库开发

5+阅读 · 2018年3月19日

Python NLP入门教程

Python开发者

9+阅读 · 2017年11月19日

Python NLP 入门教程

大数据技术

20+阅读 · 2017年10月24日

Python3爬虫之入门和正则表达式

全球人工智能

7+阅读 · 2017年10月9日

推荐｜23个Python爬虫开源项目代码：爬取微信、淘宝、豆瓣、知乎、微博等

七月在线实验室

8+阅读 · 2017年8月23日

资源整理 | 32个Python爬虫项目让你一次吃到撑

数盟

5+阅读 · 2017年8月16日

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Arxiv

10+阅读 · 2020年3月31日

A Survey of Adversarial Learning on Graphs

Arxiv

38+阅读 · 2020年3月10日

A Modern Introduction to Online Learning

Arxiv

21+阅读 · 2019年12月31日

How to train your MAML

Arxiv

26+阅读 · 2019年3月5日

Adversarial TableQA: Attention Supervision for Question Answering on Tables

Arxiv

4+阅读 · 2018年10月18日

MaskReID: A Mask Based Deep Ranking Neural Network for Person Re-identification

Arxiv

8+阅读 · 2018年4月11日

Modeling Others using Oneself in Multi-Agent Reinforcement Learning

Arxiv

4+阅读 · 2018年3月22日

The Web as a Knowledge-base for Answering Complex Questions

Arxiv

5+阅读 · 2018年3月18日

Learning Intrinsic Sparse Structures within Long Short-Term Memory

Arxiv

4+阅读 · 2018年1月30日

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

Arxiv

4+阅读 · 2017年11月27日

VIP会员