Beautiful Soup

关联选择

兄弟节点

之前说明了子节点和父节点的获取方式,如果想要获取同级的节点,请看如下例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
html = """
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling',soup.a.next_sibling)
print('prev sibling',soup.a.previous_sibling)

输出结果:
Next Sibling <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
prev sibling
Once upon a time there were three little sisters; and their names were

这段代码中可以看到next_sibling和previous_sibling分别获取节点的下一个和上一个兄弟元素,next_sibling返回后面的兄弟节点,previous_siblings返回前面的兄弟节点

提取数据

如果想要获取比如文本,属性等,也可以用同样的方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
html = """
<html>
<body>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Bob</a><a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling:')
print(type(soup.a.next_sibling))
print(soup.a.next_sibling)
print(soup.a.next_sibling.string)
print('Parent:')
print(type(soup.a.parents))
print(list(soup.a.parents)[0])
print(list(soup.a.parents)[0].attrs['class'])

输出结果:
Next Sibling:
<class 'bs4.element.Tag'>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
Parent:
<class 'generator'>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Bob</a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
</p>
['story']

调用string,attrs等获取文本和属性,如果返回多个节点,就可以转成列表后取出某个元素

方法选择器

前面都是用属性来进行选择,对于比较复杂的选择就会比较繁琐,BS提供了像find()和find_all()这样的方法,只需要输入想查询的响应参数,就可以灵活查询了

find_all()

顾名思义就是查询所有符合条件的元素,类似于是正则表达里的贪心查询
API如下:

1
find_all(name,attrs,recursive,text,**kwargs)

name

我们可以根据节点名来查询元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

输出结果:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

调用了find_all()方法,传入name参数,其参数值为ul。想要查询所有ul节点,返回结果是列表类型

因为都是Tag类型,所以依然可以使用嵌套查询

1
2
3
4
5
6
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))

输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

返回的结果是列表类型,每个元素依然还是Tag类型

1
2
3
4
5
6
7
8
9
10
11
12
13
for ul in soup.find_all(name='ul'):
print(ul.find_all(name='li'))
for li in ul.find_all(name='li'):
print(li.string)

输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
Foo
Bar
Jay
[<li class="element">Foo</li>, <li class="element">Bar</li>]
Foo
Bar

这样就可以获取到li标签的文本了

attrs

除了节点名查询,还可以传入一些属性来查询

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'name': 'elements'}))

输出结果:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

查询时传入的是attrs参数,参数类型是字典类型,要查询id为list-1的节点,就可以传入attrs={‘id’: ‘list-1’})的查询条件,返回的结果是列表形式,包含的内容就是符合id为list-1的所有节点

我们也可以不使用attrs来传递:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))

输出结果:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

在这里class对于Python来说是一个关键字,所以后面需要加一个下划线来进行区分

text

text参数可以用来匹配节点的文本,传入的形式可以是字符串,可以是正则表达式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import re
html='''
<div class="panel">
<div class="panel-body">
<a>Hello, this is a link</a>
<a>Hello, this is a link, too</a>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))

输出结果:
['Hello, this is a link', 'Hello, this is a link, too']

这里有两个a节点,而且用来find_all方法,所以文本里含link的结果都会被返回出来

find()

find()方法和find_all()方法的区别就是find()方法用于查询单个元素,也就是第一个匹配的元素

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

输出结果:
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

返回结果不再是列表类型,而是第一个匹配的节点元素,返回结果依然是Tag类型

除了find()方法,还有:
find_parents()和find_parent(),前者返回所有祖先节点,后面返回直接父节点
find_next_siblings()和find_next_sibling()与查询祖先的方法类似
find_previous_siblings()和find_previous_sibling()与查询祖先的方法类似
find_all_next()和find_next()前者返回节点后所有符合条件的节点,后者返回第一个符合条件的节点
find_all_previous()和find_previous()前者返回节点前所有符合条件的节点,后者返回第一个符合条件的节点

CSS查询器

使用CSS选择器,只需要调用select()方法,传入相应的CSS选择器即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
html='''
<div class="panel">
<div class="panel-heading">
<h4>Hello</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(type(soup.select('ul')[0]))

输出结果:
[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<class 'bs4.element.Tag'>

可以看到类似于CSS中选择器的方式
ul li就是查询ul节点下的所有li节点,结果就是li节点组成的列表

嵌套查询

select()同样支持嵌套查询

1
2
3
4
5
6
7
8
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))

输出结果:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

可以看到这里先是选择了ul节点,再遍历每个ul节点,选择其li节点
输出了所有ul节点下的所有li节点

获取属性

1
2
3
4
5
6
7
8
9
10
11
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])

输出结果:
list-1
list-1
list-2
list-2

可以看到,直接传入中括号和属性名,以及通过attrs属性获取属性值,都可以成功

获取文本

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
print('Get Text:', li.get_text())
print('String:', li.string)

输出结果:
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar
Get Text: Jay
String: Jay
Get Text: Foo
String: Foo
Get Text: Bar
String: Bar

string和get_next()两者效果完全一致

pyquery

这个库主要是以CSS选择器为主,如果对jQuery有所了解,那么这个解析库更加方便

初始化

字符串初始化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
html = '''
<div>
<ul>
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>
</ul>
</div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('li'))

输出结果:
<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

首先引入pyquery这个对象,取别名为pq,后面将HTML字符串传递给pyquery类,就完成了初始化,接下来将其传入CSS选择器,我们传入li节点,就可以选择所有的li节点

URL初始化

初始化的参数不仅可以是字符串,还可以传入网页的URL

1
2
3
4
5
6
from pyquery import PyQuery as pq
doc = pq(url='https://diao-diaoupup.cn/')
print(doc('title'))

输出结果:
<title>diao-diao-UPUP</title>

这样的话,pyquery对象会先请求这个URL,然后用得到的HTML内容完成初始化,与下面方法功能相同
1
2
3
4
from pyquery import PyQuery as pq
import requests
doc=pq(requests.get('https://diao-diaoupup.cn/').text)
print(doc('title'))

文件初始化

除了传递URL,还可以传递本地的文件名,指定参数为filename即可

1
2
3
from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

总结

1.beautiful soup推荐使用lxml解析库,必要时使用html.parser

2.节点选择筛选功能弱但是速度很快

3.建议使用find()和find_all()方法匹配单个结果或者多个结果

4.如果对CSS选择器熟悉的话,可以使用select()方法选择

5.pyquery库的使用方法类似于CSS中的选择器,相比beautiful soup中的CSS选择器更加强大,而且更加方便,URL初始化时会自动请求该网页而不需要引入requests库