Scrapy入门

xiaoxiao2025-02-18 36

Scrapy入门

环境：ubuntu16+python2.7.11+scrapy1.0.3

安装Scrapy

apt-get install scrapy

创建项目

进到一个你想创建任务的目录，如:/home/test

执行: scrapy startproject tutorial

会在当前目录下创建一个tutorial目录，目录结构如下：

tutorial/ scrapy.cfg # deploy configuration file tutorial/ # project's Python module, you'll import your code from here __init__.py items.py # project items file pipelines.py # project pipelines file settings.py # project settings file spiders/ # a directory where you'll later put your spiders __init__.py ...

定义自己的item

打开编辑items.py，在里面写上如下内容：

import scrapy class DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()

定义自己的爬虫

进入到spiders目录，在里面新建一个demo_spider.py文件，添加可执行权限。

打开编辑demo_spider.py文件，在里面写上如下内容：

import scrapy class DemoSpider(scrapy.Spider): name = "demo" allowed_domains = ["weather.com.cn"] start_urls = [ "http://www.weather.com.cn/", ] def parse(self, response): filename = 'source_code.html' with open(filename, 'wb') as f: f.write(response.body)

demo_spider.py里面没有用到items.py里面定义的title,link,desc变量，就只是把网页的body部分写到了source_code.html文件中。

运行爬虫

进入到项目的顶层目录，这里就是/etc/test/tutorial目录下

执行：scrapy crawl demo

就会把网页的源代码写到当前目录下的source_code.html文件中。

在爬取有些网站时会报：[boto] ERROR: Caught exception reading instance的错误，

怎么消除这个报错呢，需要在settings.py中加上

DOWNLOAD_HANDLERS = {'S3':None,}

且在爬虫的代码（这里是demo_spider.py）里面加上

from scrapy import optional_features optional_features.remove('boto')

报错就可以解决了。

Scrapy Shell

为了测试自己写的抓取的数据对不对，可以不用每次都运行爬虫，使用scrapy shell是很方便的。

在使用前，需要配置一下scrapy的shell,在scrapy.cfg的settings中加上shell = ipython,当然，你也可以使用bpython。

进入到爬虫项目的顶层目录（/etc/test/tutorial），执行

scrapy shell "http://www.weather.com.cn/"

会进入到ipython，并且会生成一个response变量，变量对应这网页的源代码。可以使用火狐浏览器查看http://www.weather.com.cn/网页的源代码，然后使用xpath筛选出需要抓取的内容。

如抓取网页标题：

In [2]: response.xpath('//title/text()').extract() Out[2]: [u'\u4e2d\u56fd\u5929\u6c14\u7f51-\u4e13\u4e1a\u5929\u6c14\u9884\u62a5\u3001\u6c14\u8c61\u670d\u52a1\u95e8\u6237']

抓取网页的link标签里面的链接：

In [3]: response.xpath('//link/@href').extract() Out[3]: [u'http://i.tq121.com.cn', u'http://i.tq121.com.cn/c/weather2014/x-weather.css']

转载请注明原文地址: https://ju.6miu.com/read-1296577.html

最新回复(0)