Python学习基础笔记三十四——爬虫小例子
创始人
2024-03-29 15:14:09
0

一个爬虫小例子:

import requestsimport re
import jsondef getPage(url):response=requests.get(url)return response.textdef parsePage(s):com=re.compile('
.*?
.*?(?P\d+).*?(?P.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S)ret=com.finditer(s)for i in ret:yield {"id":i.group("id"),"title":i.group("title"),"rating_num":i.group("rating_num"),"comment_num":i.group("comment_num"),}def main(num):url='https://movie.douban.com/top250?start=%s&filter='%numresponse_html=getPage(url)ret=parsePage(response_html)print(ret)f=open("move_info7","a",encoding="utf8")for obj in ret:print(obj)data=json.dumps(obj,ensure_ascii=False)f.write(data+"\n")f.close()if __name__ == '__main__':count=0for i in range(10):main(count)count+=25</code></pre> <p>但是这个例子我跑结果的时候出现问题,没有得到返回结果,我进行了单步调试:</p> <p><img alt="" height="202" src="https://img.pic99.top/linuxoffice369/202403/cda24c00261564b.png" width="1031" /></p> <p>看到reponse的返回值是418,百度下这个应该是网站的反爬程序返回的。所以,这个程序要进行下修改:</p> <pre><code class="language-python">import requests import urllib.request import redef getPage(url):herders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3','Referer': 'https://movie.douban.com/','Connection': 'keep-alive'}req = urllib.request.Request(url, headers=herders)response = urllib.request.urlopen(req)html = response.read().decode('utf-8')return htmldef parsePage(s):com = re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)ret = com.finditer(s)for i in ret:yield {"id": i.group("id"),"title": i.group("title"),"rating_num": i.group("rating_num"),"comment_num": i.group("comment_num"),}def main(num):url = 'https://movie.douban.com/top250?start=%s&filter=' % numresponse_html = getPage(url)ret = parsePage(response_html)for obj in ret:print(str(obj))count = 0 for i in range(10):main(count)count += 25</code></pre> <p>我们再来分析下正则表达式这块代码:</p> <pre><code class="language-python">'<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>'</code></pre> <p> 上面这段是正则代码:</p> <p>1)</p> <p>.*? :这是一个惰性匹配,只匹配到之前标签的这个位置。</p> <p><img alt="" height="652" src="https://img.pic99.top/linuxoffice369/202403/9983be6605b8412.png" width="1200" /></p> <p> 2)然后再看正则表达式:</p> <p>(?P<id>\d+): 这是一个分组,然后里面\d是表示的数字,后面的加号,说明是多个数字。</p> <p>前面的?P<id>是给这个分组一个名称。那么通过group(n)就可以获得这个值。</p> <p>3)compile返回的com是一个正则表达式对象。然后该对象执行finditer,由于找的内容比较多,我们就使用迭代器。</p> <p>4)然后我们就返回每个分组,每个分组都有自己的名称。我们没有使用return,而是使用的yield,那说明这个函数是生成器。这样就不会一下子占用很多内存,而是你边生成边获取。节省了内存空间。</p> <p>5)再来看下爬虫程序的整个过程:</p> <p>1. url从网页上将代码搞下来;</p> <p>2、bytes code ->utf-8 网页内容就是我们的待匹配字符串;</p> <p>3、ret是所有匹配的内容组成的列表;</p> <p>6)理解下正则的用法。</p> <!--end::Text--> </div> <!--end::Description--> <div class="mt-5"> <!--关键词搜索--> <a href="/index.php?s=news&c=search&keyword=%E8%AF%8D%E5%BA%93%E5%8A%A0%E8%BD%BD%E9%94%99%E8%AF%AF%3A%E6%9C%AA%E8%83%BD%E6%89%BE%E5%88%B0%E6%96%87%E4%BB%B6%E2%80%9CE%3A%5Chighferrum_mysql%5CConfiguration%5CDict_Stopwords.txt%E2%80%9D%E3%80%82" class="badge badge-light-primary fw-bold my-2" target="_blank">词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。</a> </div> <div class="mt-5"> <p class="fc-show-prev-next"> <strong>上一篇:</strong><a href="/code/14053.html">Java集合 超详细版+面试题</a><br> </p> <p class="fc-show-prev-next"> <strong>下一篇:</strong><a href="/code/14055.html">Postman 的简单使用</a> </p> </div> <!--begin::Block--> <div class="d-flex flex-stack mb-2 mt-10"> <!--begin::Title--> <h3 class="text-dark fs-5 fw-bold text-gray-800">相关内容</h3> <!--end::Title--> </div> <div class="separator separator-dashed mb-9"></div> <!--end::Block--> <div class="row g-10"> </div> </div> <!--end::Table widget 14--> </div> <!--end::Col--> <!--begin::Col--> <div class="col-xl-4 mt-0"> <!--begin::Chart Widget 35--> <div class="card card-flush h-md-100"> <!--begin::Header--> <div class="card-header pt-5 "> <!--begin::Title--> <h3 class="card-title align-items-start flex-column"> <!--begin::Statistics--> <div class="d-flex align-items-center mb-2"> <!--begin::Currency--> <span class="fs-5 fw-bold text-gray-800 ">热门资讯</span> <!--end::Currency--> </div> <!--end::Statistics--> </h3> <!--end::Title--> </div> <!--end::Header--> <!--begin::Body--> <div class="card-body pt-3"> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/140027.html" class="text-dark fw-bold text-hover-primary fs-6">保存时出现了1个错误,导致这篇...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">当保存文章时出现错误时,可以通过以下步骤解决问题:查看错误信息:查看错误提示信息可以帮助我们了解具体...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202403/698a81d96f8422.png')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/11234.html" class="text-dark fw-bold text-hover-primary fs-6">汇川伺服电机位置控制模式参数配...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">1. 基本控制参数设置 1)设置位置控制模式   2)绝对值位置线性模...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/186639.html" class="text-dark fw-bold text-hover-primary fs-6">不能访问光猫的的管理页面</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">光猫是现代家庭宽带网络的重要组成部分,它可以提供高速稳定的网络连接。但是,有时候我们会遇到不能访问光...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/163674.html" class="text-dark fw-bold text-hover-primary fs-6">表格中数据未显示</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">当表格中的数据未显示时,可能是由于以下几个原因导致的:HTML代码问题:检查表格的HTML代码是否正...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/150578.html" class="text-dark fw-bold text-hover-primary fs-6">本地主机上的图像未显示</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">问题描述:在本地主机上显示图像时,图像未能正常显示。解决方法:以下是一些可能的解决方法,具体取决于问...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/163243.html" class="text-dark fw-bold text-hover-primary fs-6">表格列调整大小出现问题</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">问题描述:表格列调整大小出现问题,无法正常调整列宽。解决方法:检查表格的布局方式是否正确。确保表格使...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/195760.html" class="text-dark fw-bold text-hover-primary fs-6">不一致的条件格式</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">要解决不一致的条件格式问题,可以按照以下步骤进行:确定条件格式的规则:首先,需要明确条件格式的规则是...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/84151.html" class="text-dark fw-bold text-hover-primary fs-6">Android|无法访问或保存...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">这个问题可能是由于权限设置不正确导致的。您需要在应用程序清单文件中添加以下代码来请求适当的权限:此外...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202405/202883a02494a7d.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/17007.html" class="text-dark fw-bold text-hover-primary fs-6">【NI Multisim 14...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">  目录 序言 一、工具栏 🍊1.“标准”工具栏 🍊 2.视图工具...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202405/2d9a7c3e3a74.png')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/19861.html" class="text-dark fw-bold text-hover-primary fs-6">银河麒麟V10SP1高级服务器...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">银河麒麟高级服务器操作系统简介: 银河麒麟高级服务器操作系统V10是针对企业级关键业务...</span> </div> <!--end::Title--> </div> </div> <!--end::Body--> </div> <!--end::Chart Widget 35--> </div> <!--end::Col--> </div> </div> <!--end::Content container--> </div> <!--end::Content--> </div> <!--end::Content wrapper--> <!--begin::Footer--> <div id="kt_app_footer" class="app-footer"> <!--begin::Footer container--> <div class="app-container container-xxl d-flex flex-column flex-md-row flex-center flex-md-stack py-3"> <!--begin::Copyright--> <div class="text-dark order-2 order-md-1"> <span class="text-muted fw-semibold me-1">2025 ©</span> <a href="/" target="_blank" class="text-gray-800 text-hover-primary">linux办公网</a> </div> <!--end::Copyright--> <!--begin::Menu--> <ul class="menu menu-gray-600 menu-hover-primary fw-semibold order-1"> <li class="menu-item"> <a href="/news/" target="_blank" class="menu-link px-2">linux资讯</a> </li> <li class="menu-item"> <a href="/yingyong/" target="_blank" class="menu-link px-2">linux应用</a> </li> <li class="menu-item"> <a href="/code/" target="_blank" class="menu-link px-2">编程开发</a> </li> <li class="menu-item"> <a href="/linuxzg/" target="_blank" class="menu-link px-2">Linux中国 </a> </li> </ul> <!--end::Menu--> </div> <!--end::Footer container--> </div> <!--end::Footer--> </div> <!--end:::Main--> </div> <!--end::Wrapper--> </div> <!--end::Page--> </div> <!--end::App--> <div id="kt_scrolltop" class="scrolltop" data-kt-scrolltop="true"> <!--begin::Svg Icon | path: icons/duotune/arrows/arr066.svg--> <span class="svg-icon"> <svg width="24" height="24" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"> <rect opacity="0.5" x="13" y="6" width="13" height="2" rx="1" transform="rotate(90 13 6)" fill="currentColor"></rect> <path d="M12.5657 8.56569L16.75 12.75C17.1642 13.1642 17.8358 13.1642 18.25 12.75C18.6642 12.3358 18.6642 11.6642 18.25 11.25L12.7071 5.70711C12.3166 5.31658 11.6834 5.31658 11.2929 5.70711L5.75 11.25C5.33579 11.6642 5.33579 12.3358 5.75 12.75C6.16421 13.1642 6.83579 13.1642 7.25 12.75L11.4343 8.56569C11.7467 8.25327 12.2533 8.25327 12.5657 8.56569Z" fill="currentColor"></path> </svg> </span> <!--end::Svg Icon--> </div> <!--begin::Javascript--> <script>var hostUrl = "/static/default/pc/";</script> <!--begin::Global Javascript Bundle(mandatory for all pages)--> <script src="/static/default/pc/plugins/global/plugins.bundle.js"></script> <script src="/static/default/pc/js/scripts.bundle.js"></script> <!--end::Global Javascript Bundle--> <!--end::Javascript--> </body> <!--end::Body--> </html>