python3.x百度贴吧图片爬虫(附知乎图片爬虫)

    xiaoxiao2021-03-25  111

    因为找到的大部分教程都是python2下的,所以自己撸了个python3的。

    python版本3.5,编辑器为pychram,用urllib和正则表达式。代码地址https://github.com/BladeXunGe/test

    一 贴吧图片爬虫

    简单的框架为

    # -*- coding: utf-8 -*- import urllib import urllib.request import re def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data def get_image(html): regx = r'src="(https://img.*?\.jpg)"' pattern = re.compile(regx) imlist = re.findall(pattern,repr(html)) num = 1 for img in imlist: image = download_page(img) with open('%s.jpg '%num,'wb') as fp: fp.write(image) num += 1 print('downloding pic%s'%num) return url = 'https://tieba.baidu.com/p/1181591427' html = download_page(url) get_image(html)

    这个只能爬一页

    然后做了调整

    # -*- coding: utf-8 -*- import urllib import urllib.request import re def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data def get_image(html,x): regx =r'"([.*\S]*\.jpg)" pic_ext="jpeg"' pattern = re.compile(regx) imlist = re.findall(pattern,repr(html)) print(imlist) for i in imlist: print (i) print (x) urllib.request.urlretrieve(i, '%s.jpg' % x) x += 1 return x x = 1 url = 'https://tieba.baidu.com/p/1181591427?pn=' for k in range(1,28): ul = url + str(k) print(ul) html = download_page(url) get_image(html,x) x = get_image(html,x) 存储到本地用了两种不同的写法

    后来考虑到增加一些功能路径保存和交互,为了方便储存改了写法

    # -*- coding: utf-8 -*- import urllib import urllib.request import re import os def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data path = input('please enter your place like D:/imags/') if os.path.exists(path) == False: os.mkdir(path) def get_image(html,x): regx = r'src="(https://img.*?\.jpg)"' pattern = re.compile(regx) imlist = re.findall(pattern,repr(html)) print(imlist) for img in imlist: image = download_page(img) name = '%s.jpg '% x with open(path + name, 'wb') as fp: fp.write(image) x += 1 print('downloding pic%s' % x) return x x = 1 url = input('please enter your url like https://tieba.baidu.com/p/1181591427?pn=' for k in range(1,28): ul = url + str(k) print(ul) html = download_page(url) get_image(html,x) x = get_image(html,x)还需要增加多线程,以后再说。

    二 知乎图片爬虫

    附知乎图片爬虫,功能不完善,勉强可用。

    # -*- coding: utf-8 -*- import urllib import urllib.request import re import os def download_page(url): request = urllib.request.Request(url) response = urllib.request.urlopen(request) data = response.read() return data path = input('please enter your place like D:/imags/') if os.path.exists(path) == False: os.mkdir(path) def get_image(html): regx = r'img src="(http.*?)"' pattern = re.compile(regx) imlist = re.findall(pattern,repr(html)) num = 1 for img in imlist: image = download_page(img) name = '%s.jpg ' % num with open(path + name,'wb') as fp: fp.write(image) num += 1 print('downloding pic%s'%num) return url = 'https://www.zhihu.com/question/34378366' html = download_page(url) get_image(html)链接出无法交互,以后解决。

    三 一点想法和提醒

    1 爬虫的难点为正则表达式 2 因为版本不同,所以有些语法不一样。 (1)urllib2改入urllib,引用为urllib.request (2)re.finall的用法 3 文件储存的写法    f = open()写法有风险所以换成 with open,另有urllib.request.urlretrieve()方法可用

    转载请注明原文地址: https://ju.6miu.com/read-17560.html

    最新回复(0)