python之BeautifulSoup之二带属性值的抓取(find

    xiaoxiao2023-03-24  3

    系统:Windows/python 2.7.11

    利用BeautifulSoup库抓取页面的一些标签TAG值

    再抓取一些特定属性的值

    示例标签:

    <cc>             <div id="post_content_79076951035" class="d_post_content j_d_post_content ">            进来呀<br>都是自己喜欢的<br>拿图就走你是狗 <br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=f4a2042b3c87e9504217f3642039531b/55f8e6cd7b899e514d1131fc44a7d933c9950db8.jpg" size="20418" height="852" width="480"> <br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=914d48d14d36acaf59e096f44cd88d03/6a57b319ebc4b745190bbcfec9fc1e178b8215b8.jpg" size="12400" height="600" width="400"> <br><img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=522fecd8bca1cd1105b672288910c8b0/6c318744ebf81a4cfbfce421d12a6059242da60a.jpg" size="21266" height="852" width="479"></div> <br> </cc>

    ===============================以下为代码部分==================================

    #coding=utf-8 import urllib2 from bs4 import BeautifulSoup def getImg(url):     html = urllib2.urlopen(url)     page = html.read()     soup = BeautifulSoup(page, "html.parser")     for s in soup.find_all('cc'): #获取标签为cc的tag值,得到结果:[<cc>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx........</cc>,<cc>....</cc>]集合         if 'img' not in str(s): #判断,若获取的cc值里面没有img标签,则结束本次循环             continue         d = s.find_all('img', attrs={'class':'BDE_Image'})  #获取标签为img,其中一个属性:class="BDE_Image" 所有数据,放进集合         lenth = len(d)   #集合的个数         for i in range(lenth):              print d[i].attrs['src']    #打印,属性为src的内容,机后面的http://xxxxxxxxxxxxxxxxx

    url = 'http://tieba.baidu.com/p/4161148236?fr=frs' getImg(url)

    ========================================end========================================

    转载请注明原文地址: https://ju.6miu.com/read-1200245.html
    最新回复(0)