170311 Python-steam游戏排行爬虫

xiaoxiao2021-03-25 59

1625-5 王子昂总结《2017年3月11日》【连续第161天总结】

A.Python 爬虫

B.之前爬虫试图不仅把名字爬下来，而且想把标签和价格也收集进来，但是尝试了很多次，虽然能够单独爬下标签，但是只要同时放入正则表达式中就爬不到。研究了很久也没明白是什么问题

最后把源代码中的整段复制下来的时候发现中间有大段的空格，并且位于两个段落中。也就是说中间有一个换行符，而我刚才看到正则表达式中'.'能表示任意字符，除了换行符。于是在中间加上换行符的匹配，终于成功

然后想要匹配标签的时候，因为每个游戏带有的标签个数数目不定，因此本来想用重复分组和捕获来完成的。但是看来每个括号分组的缓冲区只能存储一个数据，也就是说当使用重复分组时，第二次捕获的内容会把缓冲区的内容覆盖。

搞了很久没明白，最后基友给出了一个另类的解决方案：先把重复的标签部分全部捕获下来，然后使用字符串的split方法来分隔。因为目标是重复的分组，所以split方法一定能得到需要的内容

目前完成了爬取steam游戏排行榜中的游戏、价格和标签的代码：

#encoding:utf-8 import urllib2 import re i=0 #url='http://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998,996&special_categories=&filter=topsellers&page=1' url='http://store.steampowered.com/games/#tab=TopSellers' # url='http://store.steampowered.com/search/?sort_by=&sort_order=0&category1=998,996&special_categories=&filter=topsellers&page='+str(i) request=urllib2.Request(url) response=urllib2.urlopen(request) data=response.read() reg= r'discount_final_price">(¥ \d+).+?\n.+?<div class="tab_item_name">([^<]*)</div>.+?\n.+?\n.+?'+ \ r'<div class="tab_item_top_tags">(?:<span class="top_tag">(.+?)</span>)</div>' # r'(<span class="top_tag">.+?</span>){3}' imgre=re.compile(reg) imglist=re.findall(imgre,data) print imglist for pro in imglist: print pro[:-1], tag = str(pro[2]).split('</span><span class="top_tag">, ') print tag C. 明日计划 Python 爬虫愿望单中打折的游戏

转载请注明原文地址: https://ju.6miu.com/read-32919.html

技术

最新回复(0)