Python学习基础笔记三十四——爬虫小例子
创始人
2024-03-29 15:14:09
0

一个爬虫小例子:

import requestsimport re
import jsondef getPage(url):response=requests.get(url)return response.textdef parsePage(s):com=re.compile('
.*?
.*?(?P\d+).*?(?P.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>',re.S)ret=com.finditer(s)for i in ret:yield {"id":i.group("id"),"title":i.group("title"),"rating_num":i.group("rating_num"),"comment_num":i.group("comment_num"),}def main(num):url='https://movie.douban.com/top250?start=%s&filter='%numresponse_html=getPage(url)ret=parsePage(response_html)print(ret)f=open("move_info7","a",encoding="utf8")for obj in ret:print(obj)data=json.dumps(obj,ensure_ascii=False)f.write(data+"\n")f.close()if __name__ == '__main__':count=0for i in range(10):main(count)count+=25</code></pre> <p>但是这个例子我跑结果的时候出现问题,没有得到返回结果,我进行了单步调试:</p> <p><img alt="" height="202" src="https://img.pic99.top/linuxoffice369/202403/cda24c00261564b.png" width="1031" /></p> <p>看到reponse的返回值是418,百度下这个应该是网站的反爬程序返回的。所以,这个程序要进行下修改:</p> <pre><code class="language-python">import requests import urllib.request import redef getPage(url):herders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1;WOW64) AppleWebKit/537.36 (KHTML,like GeCKO) Chrome/45.0.2454.85 Safari/537.36 115Broswer/6.0.3','Referer': 'https://movie.douban.com/','Connection': 'keep-alive'}req = urllib.request.Request(url, headers=herders)response = urllib.request.urlopen(req)html = response.read().decode('utf-8')return htmldef parsePage(s):com = re.compile('<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>', re.S)ret = com.finditer(s)for i in ret:yield {"id": i.group("id"),"title": i.group("title"),"rating_num": i.group("rating_num"),"comment_num": i.group("comment_num"),}def main(num):url = 'https://movie.douban.com/top250?start=%s&filter=' % numresponse_html = getPage(url)ret = parsePage(response_html)for obj in ret:print(str(obj))count = 0 for i in range(10):main(count)count += 25</code></pre> <p>我们再来分析下正则表达式这块代码:</p> <pre><code class="language-python">'<div class="item">.*?<div class="pic">.*?<em .*?>(?P<id>\d+).*?<span class="title">(?P<title>.*?)</span>''.*?<span class="rating_num" .*?>(?P<rating_num>.*?)</span>.*?<span>(?P<comment_num>.*?)评价</span>'</code></pre> <p> 上面这段是正则代码:</p> <p>1)</p> <p>.*? :这是一个惰性匹配,只匹配到之前标签的这个位置。</p> <p><img alt="" height="652" src="https://img.pic99.top/linuxoffice369/202403/9983be6605b8412.png" width="1200" /></p> <p> 2)然后再看正则表达式:</p> <p>(?P<id>\d+): 这是一个分组,然后里面\d是表示的数字,后面的加号,说明是多个数字。</p> <p>前面的?P<id>是给这个分组一个名称。那么通过group(n)就可以获得这个值。</p> <p>3)compile返回的com是一个正则表达式对象。然后该对象执行finditer,由于找的内容比较多,我们就使用迭代器。</p> <p>4)然后我们就返回每个分组,每个分组都有自己的名称。我们没有使用return,而是使用的yield,那说明这个函数是生成器。这样就不会一下子占用很多内存,而是你边生成边获取。节省了内存空间。</p> <p>5)再来看下爬虫程序的整个过程:</p> <p>1. url从网页上将代码搞下来;</p> <p>2、bytes code ->utf-8 网页内容就是我们的待匹配字符串;</p> <p>3、ret是所有匹配的内容组成的列表;</p> <p>6)理解下正则的用法。</p> <!--end::Text--> </div> <!--end::Description--> <div class="mt-5"> <!--关键词搜索--> <a href="/index.php?s=news&c=search&keyword=%E8%AF%8D%E5%BA%93%E5%8A%A0%E8%BD%BD%E9%94%99%E8%AF%AF%3A%E6%9C%AA%E8%83%BD%E6%89%BE%E5%88%B0%E6%96%87%E4%BB%B6%E2%80%9CE%3A%5Chighferrum_mysql%5CConfiguration%5CDict_Stopwords.txt%E2%80%9D%E3%80%82" class="badge badge-light-primary fw-bold my-2" target="_blank">词库加载错误:未能找到文件“E:\highferrum_mysql\Configuration\Dict_Stopwords.txt”。</a> </div> <div class="mt-5"> <p class="fc-show-prev-next"> <strong>上一篇:</strong><a href="/code/14053.html">Java集合 超详细版+面试题</a><br> </p> <p class="fc-show-prev-next"> <strong>下一篇:</strong><a href="/code/14055.html">Postman 的简单使用</a> </p> </div> <!--begin::Block--> <div class="d-flex flex-stack mb-2 mt-10"> <!--begin::Title--> <h3 class="text-dark fs-5 fw-bold text-gray-800">相关内容</h3> <!--end::Title--> </div> <div class="separator separator-dashed mb-9"></div> <!--end::Block--> <div class="row g-10"> </div> </div> <!--end::Table widget 14--> </div> <!--end::Col--> <!--begin::Col--> <div class="col-xl-4 mt-0"> <!--begin::Chart Widget 35--> <div class="card card-flush h-md-100"> <!--begin::Header--> <div class="card-header pt-5 "> <!--begin::Title--> <h3 class="card-title align-items-start flex-column"> <!--begin::Statistics--> <div class="d-flex align-items-center mb-2"> <!--begin::Currency--> <span class="fs-5 fw-bold text-gray-800 ">热门资讯</span> <!--end::Currency--> </div> <!--end::Statistics--> </h3> <!--end::Title--> </div> <!--end::Header--> <!--begin::Body--> <div class="card-body pt-3"> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202405/2d9a7c3e3a74.png')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/19861.html" class="text-dark fw-bold text-hover-primary fs-6">银河麒麟V10SP1高级服务器...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">银河麒麟高级服务器操作系统简介: 银河麒麟高级服务器操作系统V10是针对企业级关键业务...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202405/202883a02494a7d.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/17007.html" class="text-dark fw-bold text-hover-primary fs-6">【NI Multisim 14...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">  目录 序言 一、工具栏 🍊1.“标准”工具栏 🍊 2.视图工具...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/63851.html" class="text-dark fw-bold text-hover-primary fs-6">AWSECS:访问外部网络时出...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">如果您在AWS ECS中部署了应用程序,并且该应用程序需要访问外部网络,但是无法正常访问,可能是因为...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/186639.html" class="text-dark fw-bold text-hover-primary fs-6">不能访问光猫的的管理页面</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">光猫是现代家庭宽带网络的重要组成部分,它可以提供高速稳定的网络连接。但是,有时候我们会遇到不能访问光...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/63983.html" class="text-dark fw-bold text-hover-primary fs-6">AWSElasticBeans...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">在Dockerfile中手动配置nginx反向代理。例如,在Dockerfile中添加以下代码:FR...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/84151.html" class="text-dark fw-bold text-hover-primary fs-6">Android|无法访问或保存...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">这个问题可能是由于权限设置不正确导致的。您需要在应用程序清单文件中添加以下代码来请求适当的权限:此外...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202403/b8ff226395d4434.png')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/14159.html" class="text-dark fw-bold text-hover-primary fs-6">月入8000+的steam搬砖...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">大家好,我是阿阳 今天要给大家介绍的是 steam 游戏搬砖项目,目前...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/uploadfile/202406/4dc896ec8d25f62.ico')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/20480.html" class="text-dark fw-bold text-hover-primary fs-6">​ToDesk 远程工具安装及...</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">目录 前言 ToDesk 优势 ToDesk 下载安装 ToDesk 功能展示 文件传输 设备链接 ...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/148340.html" class="text-dark fw-bold text-hover-primary fs-6">北信源内网安全管理卸载</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">北信源内网安全管理是一款网络安全管理软件,主要用于保护内网安全。在日常使用过程中,卸载该软件是一种常...</span> </div> <!--end::Title--> </div> <!--begin::Item--> <div class="d-flex flex-stack mb-7"> <!--begin::Symbol--> <div class="symbol symbol-60px symbol-2by3 me-4"> <div class="symbol-label" style="background-image: url('/static/assets/images/nopic.gif')"></div> </div> <!--end::Symbol--> <!--begin::Title--> <div class="m-0"> <a href="/code/64812.html" class="text-dark fw-bold text-hover-primary fs-6">AWS管理控制台菜单和权限</a> <span class="text-gray-600 fw-semibold d-block pt-1 fs-7">要在AWS管理控制台中创建菜单和权限,您可以使用AWS Identity and Access Ma...</span> </div> <!--end::Title--> </div> </div> <!--end::Body--> </div> <!--end::Chart Widget 35--> </div> <!--end::Col--> </div> </div> <!--end::Content container--> </div> <!--end::Content--> </div> <!--end::Content wrapper--> <!--begin::Footer--> <div id="kt_app_footer" class="app-footer"> <!--begin::Footer container--> <div class="app-container container-xxl d-flex flex-column flex-md-row flex-center flex-md-stack py-3"> <!--begin::Copyright--> <div class="text-dark order-2 order-md-1"> <span class="text-muted fw-semibold me-1">2025 ©</span> <a href="/" target="_blank" class="text-gray-800 text-hover-primary">linux办公网</a> </div> <!--end::Copyright--> <!--begin::Menu--> <ul class="menu menu-gray-600 menu-hover-primary fw-semibold order-1"> <li class="menu-item"> <a href="/news/" target="_blank" class="menu-link px-2">linux资讯</a> </li> <li class="menu-item"> <a href="/yingyong/" target="_blank" class="menu-link px-2">linux应用</a> </li> <li class="menu-item"> <a href="/code/" target="_blank" class="menu-link px-2">编程开发</a> </li> <li class="menu-item"> <a href="/linuxzg/" target="_blank" class="menu-link px-2">Linux中国 </a> </li> </ul> <!--end::Menu--> </div> <!--end::Footer container--> </div> <!--end::Footer--> </div> <!--end:::Main--> </div> <!--end::Wrapper--> </div> <!--end::Page--> </div> <!--end::App--> <div id="kt_scrolltop" class="scrolltop" data-kt-scrolltop="true"> <!--begin::Svg Icon | path: icons/duotune/arrows/arr066.svg--> <span class="svg-icon"> <svg width="24" height="24" viewBox="0 0 24 24" fill="none" xmlns="http://www.w3.org/2000/svg"> <rect opacity="0.5" x="13" y="6" width="13" height="2" rx="1" transform="rotate(90 13 6)" fill="currentColor"></rect> <path d="M12.5657 8.56569L16.75 12.75C17.1642 13.1642 17.8358 13.1642 18.25 12.75C18.6642 12.3358 18.6642 11.6642 18.25 11.25L12.7071 5.70711C12.3166 5.31658 11.6834 5.31658 11.2929 5.70711L5.75 11.25C5.33579 11.6642 5.33579 12.3358 5.75 12.75C6.16421 13.1642 6.83579 13.1642 7.25 12.75L11.4343 8.56569C11.7467 8.25327 12.2533 8.25327 12.5657 8.56569Z" fill="currentColor"></path> </svg> </span> <!--end::Svg Icon--> </div> <!--begin::Javascript--> <script>var hostUrl = "/static/default/pc/";</script> <!--begin::Global Javascript Bundle(mandatory for all pages)--> <script src="/static/default/pc/plugins/global/plugins.bundle.js"></script> <script src="/static/default/pc/js/scripts.bundle.js"></script> <!--end::Global Javascript Bundle--> <!--end::Javascript--> </body> <!--end::Body--> </html>