Python爬虫基础之BeautifulSoup
创始人
2024-04-01 07:07:50
0

Python爬虫基础之BeautifulSoup

  • 一、BeautifulSoup基础功能
    • 1.1 CSS和前端常用标签及属性值
    • 1.2 HTML解析
      • 1.2.1 BeautifulSoup的find()和find_all()函数
      • 1.2.2 获取标签的子标签、兄弟标签、父标签
        • 1.2.2.1 子标签和其他后代标签
        • 1.2.2.2 兄弟标签
        • 1.2.2.3 父标签
    • 1.3 正则表达式和BeautifulSoup
    • 1.4 获取属性
    • 1.5 lambda函数应用
    • 1.6 爬取页面文件并下载到本地

一、BeautifulSoup基础功能

1.1 CSS和前端常用标签及属性值

  • 层叠样式表:CSS(Cascading Style Sheet):CSS是一种定义样式结构如字体、颜色、位置等的语言,被用于描述网页上的信息格式化和显示的方式;
  • 前端常用标签及属性值参见:https://www.cnblogs.com/blknemo/p/10553021.html

1.2 HTML解析

  • 在http://www.pythonscraping.com/pages/warandpeace.html这个页面里,小说人物的对话内容都是红色的,人物名称都是绿色的。
    在这里插入图片描述
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/warandpeace.html')
bsObj = BeautifulSoup(html, "html.parser")  
# 不加"html.parser"参数时,会GuessedAtParserWarning提示,表明未明确指定解析器
nameList = bsObj.find_all('span', {'class': 'green'})
# 通过BeautifulSoup 对象,我们可以用find_all 函数抽取只包含在 标签里的文字,这样就会得到一个人物名称的Python 列表
print(nameList)
#结果
"""
[Anna
Pavlovna Scherer, Empress Marya
Fedorovna, Prince Vasili Kuragin, Anna Pavlovna, St. Petersburg, the prince, Anna Pavlovna, Anna Pavlovna, the prince, the prince, the prince, Prince Vasili, Anna Pavlovna, Anna Pavlovna, the prince, Wintzingerode, King of Prussia, le Vicomte de Mortemart, Montmorencys, Rohans, Abbe Morio, the Emperor, the prince, Prince Vasili, Dowager Empress Marya Fedorovna, the baron, Anna Pavlovna, the Empress, the Empress, Anna Pavlovna's, Her Majesty, Baron
Funke, The prince, Anna
Pavlovna, the Empress, The prince, Anatole, the prince, The prince, Anna
Pavlovna, Anna Pavlovna]
"""
for name in nameList:# .get_text() 会把你正在处理的HTML 文档中所有的标签都清除,然后返回一个只包含文字的字符串。假如你正在处理一个包含许多超链接、段落和标签的大段源代码,那么.get_text() 会把这些超链接、段落和标签都清除掉,只剩下一串不带标签的文字。print(name.getText())  # 或者是print(name.text)
#结果
"""
Anna Pavlovna Scherer
Empress Marya Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
Anna Pavlovna
the prince
Wintzingerode
King of Prussia
le Vicomte de Mortemart
Montmorencys
Rohans
Abbe Morio
the Emperor
the prince
Prince Vasili
Dowager Empress Marya Fedorovna
the baron
Anna Pavlovna
the Empress
the Empress
Anna Pavlovna's
Her Majesty
Baron Funke
The prince
Anna Pavlovna
the Empress
The prince
Anatole
the prince
The prince
Anna Pavlovna
Anna Pavlovna
"""

1.2.1 BeautifulSoup的find()和find_all()函数

  • 通过BeautifulSoup的find()和find_all()函数可以通过标签的不同属性轻松地过滤HTML 页面,查找需要的标签组或单个标签;
  • 函数参数:
    find_all(tag, attributes, recursive, text, limit, keywords)
    find(tag, attributes, recursive, text, keywords)
#获得一个包含HTML 文档中所有标题标签的列表
bsObj.find_all({"h1","h2","h3","h4","h5","h6"})
#结果
"""
[

War and Peace

,

Chapter 1

] """ #返回HTML 文档里红色与绿色两种颜色的span 标签: bsObj.find_all("span", {"class":{"green", "red"}}) """ [Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don't tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist- I really believe he is Antichrist- I will have nothing more to do with you and you are no longer my friend, no longer my 'faithful slave,' as you call yourself! But how do you do? I see I have frightened you- sit down and tell me all the news., Anna Pavlovna Scherer, Empress Marya Fedorovna, Prince Vasili Kuragin, Anna Pavlovna, St. Petersburg, If you have nothing better to do, Count [or Prince], and if the prospect of spending an evening with a poor invalid is not too terrible, I shall be very charmed to see you tonight between 7 and 10- Annette Scherer., Heavens! what a virulent attack!, the prince, Anna Pavlovna, First of all, dear friend, tell me how you are. Set your friend's mind at rest,, Can one be well while suffering morally? Can one be calm in times like these if one has any feeling?, Anna Pavlovna, You are staying the whole evening, I hope?, And the fete at the English ambassador's? Today is Wednesday. I must put in an appearance there,, the prince, My daughter is coming for me to take me there., I thought today's fete had been canceled. I confess all these festivities and fireworks are becoming wearisome., If they had known that you wished it, the entertainment would have been put off,, the prince, Don't tease! Well, and what has been decided about Novosiltsev's dispatch? You know everything., What can one say about it?, the prince, What has been decided? They have decided that Buonaparte has burnt his boats, and I believe that we are ready to burn ours., Prince Vasili, Anna Pavlovna, Anna Pavlovna, Oh, don't speak to me of Austria. Perhaps I don't understand things, but Austria never has wished, and does not wish, for war. She is betraying us! Russia alone must save Europe. Our gracious sovereign recognizes his high vocation and will be true to it. That is the one thing I have faith in! Our good and wonderful sovereign has to perform the noblest role on earth, and he is so virtuous and noble that God will not forsake him. He will fulfill his vocation and crush the hydra of revolution, which has become more terrible than ever in the person of this murderer and villain! We alone must avenge the blood of the just one.... Whom, I ask you, can we rely on?... England with her commercial spirit will not and cannot understand the Emperor Alexander's loftiness of soul. She has refused to evacuate Malta. She wanted to find, and still seeks, some secret motive in our actions. What answer did Novosiltsev get? None. The English have not understood and cannot understand the self-abnegation of our Emperor who wants nothing for himself, but only desires the good of mankind. And what have they promised? Nothing! And what little they have promised they will not perform! Prussia has always declared that Buonaparte is invincible, and that all Europe is powerless before him.... And I don't believe a word that Hardenburg says, or Haugwitz either. This famous Prussian neutrality is just a trap. I have faith only in God and the lofty destiny of our adored monarch. He will save Europe!, I think,, the prince, that if you had been sent instead of our dear Wintzingerode you would have captured the King of Prussia's consent by assault. You are so eloquent. Will you give me a cup of tea?, Wintzingerode, King of Prussia, In a moment. A propos,, I am expecting two very interesting men tonight, le Vicomte de Mortemart, who is connected with the Montmorencys through the Rohans, one of the best French families. He is one of the genuine emigres, the good ones. And also the Abbe Morio. Do you know that profound thinker? He has been received by the Emperor. Had you heard?, le Vicomte de Mortemart, Montmorencys, Rohans, Abbe Morio, the Emperor, I shall be delighted to meet them,, the prince, But tell me,, is it true that the Dowager Empress wants Baron Funke to be appointed first secretary at Vienna? The baron by all accounts is a poor creature., Prince Vasili, Dowager Empress Marya Fedorovna, the baron, Anna Pavlovna, the Empress, Baron Funke has been recommended to the Dowager Empress by her sister,, the Empress, Anna Pavlovna's, Her Majesty, Baron Funke, The prince, Anna Pavlovna, the Empress, Now about your family. Do you know that since your daughter came out everyone has been enraptured by her? They say she is amazingly beautiful., The prince, I often think,, I often think how unfairly sometimes the joys of life are distributed. Why has fate given you two such splendid children? I don't speak of Anatole, your youngest. I don't like him,, Anatole, Two such charming children. And really you appreciate them less than anyone, and so you don't deserve to have them., I can't help it,, the prince, Lavater would have said I lack the bump of paternity., Don't joke; I mean to have a serious talk with you. Do you know I am dissatisfied with your younger son? Between ourselves, he was mentioned at Her Majesty's and you were pitied...., The prince, What would you have me do?, You know I did all a father could for their education, and they have both turned out fools. Hippolyte is at least a quiet fool, but Anatole is an active one. That is the only difference between them., And why are children born to such men as you? If you were not a father there would be nothing I could reproach you with,, Anna Pavlovna, I am your faithful slave and to you alone I can confess that my children are the bane of my life. It is the cross I have to bear. That is how I explain it to myself. It can't be helped!, Anna Pavlovna] """
  • 递归参数recursive 是一个布尔变量。你想抓取HTML 文档标签结构里多少层的信息?如果recursive 设置为True,find_all 就会根据你的要求去查找标签参数的所有子标签,以及子标签的子标签。如果recursive 设置为False,find_all 就只查找文档的一级标签。find_all 默认是支持递归查找的(recursive 默认值是True);
  • 文本参数text 有点不同,它是用标签的文本内容去匹配,而不是用标签的属性。假如我们想查找前面网页中包含“the prince”内容的标签数量,我们可以把之前的find_all 方法换成下面的代码:
nameList = bsObj.find_all(text="the prince")
print(len(nameList))
#结果
"""
7
"""
  • 范围限制参数limit,显然只用于find_all 方法。find 其实等价于find_all 的limit 等于1 时的情形。如果你只对网页中获取的前x 项结果感兴趣,就可以设置它。但是要注意,这个参数设置之后,获得的前几项结果是按照网页上的顺序排序的,未必是你想要的那前几项。
  • 还有一个关键词参数keyword,可以让你选择那些具有指定属性的标签。例如:
allText = bsObj.find_all(id="text")
print(allText[0].get_text())
"""
"Well, Prince, so Genoa and Lucca are now just family estates of the
Buonapartes. But I warn you, if you don't tell me that this means war,
if you still try to defend the infamies and horrors perpetrated by
that Antichrist- I really believe he is Antichrist- I will have
nothing more to do with you and you are no longer my friend, no longer
my 'faithful slave,' as you call yourself! But how do you do? I see
I have frightened you- sit down and tell me all the news."It was in July, 1805, and the speaker was the well-known Anna
Pavlovna Scherer, maid of honor and favorite of the Empress Marya
Fedorovna. With these words she greeted Prince Vasili Kuragin, a man
of high rank and importance, who was the first to arrive at her
reception. Anna Pavlovna had had a cough for some days. She was, as
she said, suffering from la grippe; grippe being then a new word in
St. Petersburg, used only by the elite.All her invitations without exception, written in French, and
delivered by a scarlet-liveried footman that morning, ran as follows:"If you have nothing better to do, Count [or Prince], and if the
prospect of spending an evening with a poor invalid is not too
terrible, I shall be very charmed to see you tonight between 7 and 10-
Annette Scherer.""Heavens! what a virulent attack!" replied the prince, not in the
least disconcerted by this reception. He had just entered, wearing
an embroidered court uniform, knee breeches, and shoes, and had
stars on his breast and a serene expression on his flat face. He spoke
in that refined French in which our grandfathers not only spoke but
thought, and with the gentle, patronizing intonation natural to a
man of importance who had grown old in society and at court. He went
up to Anna Pavlovna, kissed her hand, presenting to her his bald,
scented, and shining head, and complacently seated himself on the
sofa."First of all, dear friend, tell me how you are. Set your friend's
mind at rest," said he without altering his tone, beneath the
politeness and affected sympathy of which indifference and even
irony could be discerned."Can one be well while suffering morally? Can one be calm in times
like these if one has any feeling?" said Anna Pavlovna. "You are
staying the whole evening, I hope?""And the fete at the English ambassador's? Today is Wednesday. I
must put in an appearance there," said the prince. "My daughter is
coming for me to take me there.""I thought today's fete had been canceled. I confess all these
festivities and fireworks are becoming wearisome.""If they had known that you wished it, the entertainment would
have been put off," said the prince, who, like a wound-up clock, by
force of habit said things he did not even wish to be believed."Don't tease! Well, and what has been decided about Novosiltsev's
dispatch? You know everything.""What can one say about it?" replied the prince in a cold,
listless tone. "What has been decided? They have decided that
Buonaparte has burnt his boats, and I believe that we are ready to
burn ours."Prince Vasili always spoke languidly, like an actor repeating a
stale part. Anna Pavlovna Scherer on the contrary, despite her forty
years, overflowed with animation and impulsiveness. To be an
enthusiast had become her social vocation and, sometimes even when she
did not feel like it, she became enthusiastic in order not to
disappoint the expectations of those who knew her. The subdued smile
which, though it did not suit her faded features, always played
round her lips expressed, as in a spoiled child, a continual
consciousness of her charming defect, which she neither wished, nor
could, nor considered it necessary, to correct.In the midst of a conversation on political matters Anna Pavlovna
burst out:"Oh, don't speak to me of Austria. Perhaps I don't understand
things, but Austria never has wished, and does not wish, for war.
She is betraying us! Russia alone must save Europe. Our gracious
sovereign recognizes his high vocation and will be true to it. That is
the one thing I have faith in! Our good and wonderful sovereign has to
perform the noblest role on earth, and he is so virtuous and noble
that God will not forsake him. He will fulfill his vocation and
crush the hydra of revolution, which has become more terrible than
ever in the person of this murderer and villain! We alone must
avenge the blood of the just one.... Whom, I ask you, can we rely
on?... England with her commercial spirit will not and cannot
understand the Emperor Alexander's loftiness of soul. She has
refused to evacuate Malta. She wanted to find, and still seeks, some
secret motive in our actions. What answer did Novosiltsev get? None.
The English have not understood and cannot understand the
self-abnegation of our Emperor who wants nothing for himself, but only
desires the good of mankind. And what have they promised? Nothing! And
what little they have promised they will not perform! Prussia has
always declared that Buonaparte is invincible, and that all Europe
is powerless before him.... And I don't believe a word that Hardenburg
says, or Haugwitz either. This famous Prussian neutrality is just a
trap. I have faith only in God and the lofty destiny of our adored
monarch. He will save Europe!"She suddenly paused, smiling at her own impetuosity."I think," said the prince with a smile, "that if you had been
sent instead of our dear Wintzingerode you would have captured the
King of Prussia's consent by assault. You are so eloquent. Will you
give me a cup of tea?""In a moment. A propos," she added, becoming calm again, "I am
expecting two very interesting men tonight, le Vicomte de Mortemart,
who is connected with the Montmorencys through the Rohans, one of
the best French families. He is one of the genuine emigres, the good
ones. And also the Abbe Morio. Do you know that profound thinker? He
has been received by the Emperor. Had you heard?""I shall be delighted to meet them," said the prince. "But tell me,"
he added with studied carelessness as if it had only just occurred
to him, though the question he was about to ask was the chief motive
of his visit, "is it true that the Dowager Empress wants Baron Funke
to be appointed first secretary at Vienna? The baron by all accounts
is a poor creature."Prince Vasili wished to obtain this post for his son, but others
were trying through the Dowager Empress Marya Fedorovna to secure it
for the baron.Anna Pavlovna almost closed her eyes to indicate that neither she
nor anyone else had a right to criticize what the Empress desired or
was pleased with."Baron Funke has been recommended to the Dowager Empress by her
sister," was all she said, in a dry and mournful tone.As she named the Empress, Anna Pavlovna's face suddenly assumed an
expression of profound and sincere devotion and respect mingled with
sadness, and this occurred every time she mentioned her illustrious
patroness. She added that Her Majesty had deigned to show Baron
Funke, and again her face clouded over with sadness.The prince was silent and looked indifferent. But, with the
womanly and courtierlike quickness and tact habitual to her, Anna
Pavlovna wished both to rebuke him (for daring to speak he had done of
a man recommended to the Empress) and at the same time to console him,
so she said:"Now about your family. Do you know that since your daughter came
out everyone has been enraptured by her? They say she is amazingly
beautiful."The prince bowed to signify his respect and gratitude."I often think," she continued after a short pause, drawing nearer
to the prince and smiling amiably at him as if to show that
political and social topics were ended and the time had come for
intimate conversation- "I often think how unfairly sometimes the
joys of life are distributed. Why has fate given you two such splendid
children? I don't speak of Anatole, your youngest. I don't like
him," she added in a tone admitting of no rejoinder and raising her
eyebrows. "Two such charming children. And really you appreciate
them less than anyone, and so you don't deserve to have them."And she smiled her ecstatic smile."I can't help it," said the prince. "Lavater would have said I
lack the bump of paternity.""Don't joke; I mean to have a serious talk with you. Do you know I
am dissatisfied with your younger son? Between ourselves" (and her
face assumed its melancholy expression), "he was mentioned at Her
Majesty's and you were pitied...."The prince answered nothing, but she looked at him significantly,
awaiting a reply. He frowned."What would you have me do?" he said at last. "You know I did all
a father could for their education, and they have both turned out
fools. Hippolyte is at least a quiet fool, but Anatole is an active
one. That is the only difference between them." He said this smiling
in a way more natural and animated than usual, so that the wrinkles
round his mouth very clearly revealed something unexpectedly coarse
and unpleasant."And why are children born to such men as you? If you were not a
father there would be nothing I could reproach you with," said Anna
Pavlovna, looking up pensively."I am your faithful slave and to you alone I can confess that my
children are the bane of my life. It is the cross I have to bear. That
is how I explain it to myself. It can't be helped!"He said no more, but expressed his resignation to cruel fate by a
gesture. Anna Pavlovna meditated.
"""
  • 关键词参数的注意事项
    虽然关键词参数keyword 在一些场景中很有用,但是,它是BeautifulSoup 在技术上做的一个冗余功能。任何用关键词参数能够完成的任务,同样可以用本章后面将介绍的技术解决。
    例如,下面两行代码是完全一样的:
bsObj.find_all(id="text")
bsObj.findA_all("", {"id":"text"})

另外,用keyword 偶尔会出现问题,尤其是在用class 属性查找标签的时候,因为class 是Python 中受保护的关键字。也就是说,class 是Python 语言的保留字,在Python 程序里是不能当作变量或参数名使用,假如你运行下面的代码,Python 就会因为你误用class 保留字而产生一个语法错误:

bsObj.find_all(class="green")

不过,你可以用BeautifulSoup 提供的方案,在class 后面增加一个下划线解决此报错:

bsObj.find_all(class_="green")

另外,你也可以用属性参数把class 用引号包起来:

bsObj.find_all("", {"class":"green"})
  • 标签Tag 对象
    BeautifulSoup 对象通过find 和find_all,或者直接调用子标签获取的一列对象或单个对象,就像:
bsObj.div.h1

1.2.2 获取标签的子标签、兄弟标签、父标签

在这里插入图片描述

  • HTML页面可以映射成一棵树,如下所示:
    在这里插入图片描述

1.2.2.1 子标签和其他后代标签

  • child()函数:用于筛选对象的子标签;
  • tr 标签是tabel 标签的子标签,而tr、th、td、img 和span标签都是tabel 标签的后代标签。所有的子标签都是后代标签,但不是所有的后代标签都是子标签。
  • 爬取id为giftList的孩子标签:
    在这里插入图片描述
from urllib.request import urlopen
from bs4 import BeautifulSouphtml = urlopen('https://www.pythonscraping.com/pages/page3.html')
bsObj = BeautifulSoup(html, 'html.parser')
print('-----child nodes-----')
for child in bsObj.find('table', {'id': 'giftList'}).children:print(child)
"""
-----child nodes-----
Item Title

Description

Cost

Image

Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00



Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52



Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

$10,005.00



Dead Parrot

This is an ex-parrot! Or maybe he's only resting?

$0.50



Mystery Box

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!

$1.50



"""
  • descendants()函数:用于筛选步骤的后代标签;
  • 爬取id为giftList的后代标签:会把该标签下包含的所有标签爬取出来
print('-----descendants nodes-----')
for descendant in bsObj.find('table', {'id': 'giftList'}).descendants:print(descendant)
"""
-----descendants nodes-----

Item Title

Description

Cost

Image


Item Title
Item Title
Description
Description
Cost
Cost
Image
Image
Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00




Vegetable Basket
Vegetable Basket
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!Now with super-colorful bell peppers!
Now with super-colorful bell peppers!
$15.00
$15.00


Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52




Russian Nesting Dolls
Russian Nesting Dolls
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 
8 entire dolls per set! Octuple the presents!
8 entire dolls per set! Octuple the presents!
$10,000.52
$10,000.52


Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

$10,005.00




Fish Painting
Fish Painting
If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!
If something seems fishy about this painting, it's because it's a fish! 
Also hand-painted by trained monkeys!
Also hand-painted by trained monkeys!
$10,005.00
$10,005.00


Dead Parrot

This is an ex-parrot! Or maybe he's only resting?

$0.50




Dead Parrot
Dead Parrot
This is an ex-parrot! Or maybe he's only resting?
This is an ex-parrot! 
Or maybe he's only resting?
Or maybe he's only resting?
$0.50
$0.50


Mystery Box

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!

$1.50




Mystery Box
Mystery Box
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. 
Keep your friends guessing!
Keep your friends guessing!
$1.50
$1.50


"""

1.2.2.2 兄弟标签

  • 对象不能把自己作为兄弟标签。任何时候你获取一个标签的兄弟标签,都不会包含这个标签本身。其次,这个函数只调用后面的兄弟标签。例如,如果我们选择一组标签中位于中间位置的一个标签,然后用next_siblings() 函数,那么它就只会返回在它后面的兄弟标签。
  • next_siblings()函数:返回对象之后的一组兄弟标签;
  • previous_siblings()函数: 返回对象之前的一组兄弟标签;
  • next_sibling 和previous_sibling 函数,与next_siblings 和previous_siblings的作用类似,只是它们返回的是单个标签,而不是一组标签。
  • 爬取id为giftList的孩子标签tr的兄弟标签:
    在这里插入图片描述
print('-----sibling nodes-----')
for sibling in bsObj.find('table', {'id': 'giftList'}).tr.next_siblings:print(sibling)
"""
-----sibling nodes-----
Vegetable Basket

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
Now with super-colorful bell peppers!

$15.00



Russian Nesting Dolls

Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! 8 entire dolls per set! Octuple the presents!

$10,000.52



Fish Painting

If something seems fishy about this painting, it's because it's a fish! Also hand-painted by trained monkeys!

$10,005.00



Dead Parrot

This is an ex-parrot! Or maybe he's only resting?

$0.50



Mystery Box

If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. Keep your friends guessing!

$1.50



"""

1.2.2.3 父标签

  • parents()函数:返回对象父标签的一组标签;
  • parent()函数:返回对象父标签的一个标签;
    在这里插入图片描述
print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text()) #或者.text也可以获取到$15.00
"""
$15.00
"""

1.3 正则表达式和BeautifulSoup

正则表达式参见:https://editor.csdn.net/md/?articleId=115872907

  • 待抓取的网页是http://www.pythonscraping.com/pages/page3.html
    注意观察网页上有几个商品图片——它们的源代码形式如下:

    如果我们想抓取所有图片的URL 链接,非常直接的做法就是用find_all(“img”) 抓取所有图片,对吗?但是,有个问题。除了那些明显“多余的”图片(比如,LOGO)之外,新式的网站里都有一些隐藏图片,用于网页布局留白和元素对齐的空白图片,以及一些不容易察觉到的图片标签。总之,你不能仅用商品图片来统计网页上所有的图片。
  • 而且网页的布局也可能会变化,或者,因为某些原因,我们不想通过图片在网页中的位置来查找标签。那么当你想抓取随机分布在网站里的某个元素或数据时,就会出现问题。例如,一些网页的最上面可能有一张商品图片,但是在另一些网页上没有。解决这类问题的办法,就是直接定位那些标签来查找信息。
import re
from bs4 import BeautifulSoup
from urllib.request import urlopenhtml = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, 'html.parser')
images = bsObj.find_all('img', {'src': re.compile('\.\.\/img\/gifts\/img.*\.jpg')})
for image in images:print(image['src'])
"""
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg
"""

1.4 获取属性

 myImgTag.attrs["src"]

1.5 lambda函数应用

  • Lambda 表达式本质上就是一个函数,可以作为其他函数的变量使用;也就是说,一个函数不是定义成f(x, y),而是定义成f(g(x), y),或f(g(x), h(x)) 的形式。
  • BeautifulSoup 允许我们把特定函数类型当作find_all 函数的参数。唯一的限制条件是这些函数必须把一个标签作为参数且返回结果是布尔类型。BeautifulSoup 用这个函数来评估它遇到的每个标签对象,最后把评估结果为“真”的标签保留,把其他标签剔除。
  • 例如,下面的代码就是获取有两个属性的标签:
soup.find_all(lambda tag: len(tag.attrs) == 2)
#结果
#这行代码会找出下面的标签:

1.6 爬取页面文件并下载到本地

  • urllib.request.urlretrieve 可以根据文件的URL 下载文件到本地;
    在这里插入图片描述
url = "http://attack.mitre.org/"
html = urlopen(url)
bsObj = BeautifulSoup(html, 'html.parser')
imageLocation = bsObj.find("div", {"class": "py-1"}).find("img")["src"]
print(imageLocation)
urlretrieve(url + imageLocation[1:], "logo.png") #[1:]为了去除一个/
"""
#imageLocation路径为
/theme/images/ATT&CK_red.png
"""

在这里插入图片描述

相关内容

热门资讯

银河麒麟V10SP1高级服务器... 银河麒麟高级服务器操作系统简介: 银河麒麟高级服务器操作系统V10是针对企业级关键业务...
【NI Multisim 14...   目录 序言 一、工具栏 🍊1.“标准”工具栏 🍊 2.视图工具...
AWSECS:访问外部网络时出... 如果您在AWS ECS中部署了应用程序,并且该应用程序需要访问外部网络,但是无法正常访问,可能是因为...
不能访问光猫的的管理页面 光猫是现代家庭宽带网络的重要组成部分,它可以提供高速稳定的网络连接。但是,有时候我们会遇到不能访问光...
AWSElasticBeans... 在Dockerfile中手动配置nginx反向代理。例如,在Dockerfile中添加以下代码:FR...
Android|无法访问或保存... 这个问题可能是由于权限设置不正确导致的。您需要在应用程序清单文件中添加以下代码来请求适当的权限:此外...
月入8000+的steam搬砖... 大家好,我是阿阳 今天要给大家介绍的是 steam 游戏搬砖项目,目前...
​ToDesk 远程工具安装及... 目录 前言 ToDesk 优势 ToDesk 下载安装 ToDesk 功能展示 文件传输 设备链接 ...
北信源内网安全管理卸载 北信源内网安全管理是一款网络安全管理软件,主要用于保护内网安全。在日常使用过程中,卸载该软件是一种常...
AWS管理控制台菜单和权限 要在AWS管理控制台中创建菜单和权限,您可以使用AWS Identity and Access Ma...