scrapy整合hbase爬取数据并存入hbase

xiaoxiao2021-03-25 105

在网上看了大篇的帖子都是关于scrapy与mongodb、mysql、redis等集成的文章，唯独没有看到scrapy与hbase整合的文章。今天动手实验了一下，整理成本篇博文，分享给大家。

scrapy爬取数据的例子网上很多，本人在此就不再赘诉了。

此处只着重描写scrapy如何入库至hbase。

本文主要通过HappyBase操作hbase。

HappyBase 是 FaceBook 员工开发的操作 HBase 的 Python 库，其基于 Python Thrift，但使用方式比 Thrift 简单、简洁许多，已被广泛应用。

1、安装happybase pip install happybase

2、启动hbase thrift服务 nohup hbase thrift -p 9090 start &

3、在scrapy项目下setting.py文件中定义HBASE_HOST 和HBASE_TABLE

HBASE_HOST = '192.168.22.15' HBASE_TABLE = 'novel'

4、在pipelines.py中编写Hbase入库的Pipeline

class NovelHBasePipeline(object): def __init__(self): host = settings['HBASE_HOST'] table_name = settings['HBASE_TABLE'] connection = happybase.Connection(host) table = connection.table(table_name) self.table = table def process_item(self, item, spider): bookName = item['bookName'] bookTitle = item['bookTitle'] chapterURL = item['chapterURL'] self.table.put(md5(bookName + bookTitle).hexdigest(), {'cf1:bookname': bookName, 'cf1:booktitle': bookTitle, 'cf1:chapterurl': chapterURL}) return item

5、在setting.py文件中配置编写的Pipeline

ITEM_PIPELINES = { 'novelspider.pipelines.NovelspiderPipeline': 500, 'novelspider.pipelines.NovelHBasePipeline': 1 }

至此，所有的整合工作已经完成，即可运行您的spider爬取数据并存至hbase。

转载请注明原文地址: https://ju.6miu.com/read-23992.html

技术

最新回复(0)