XPath

之前实现了最基本的爬虫，使用正则表达式来获取页面，但是正则表达式在遇到一些复杂问题的时候就会变得繁琐，有一个地方写错了就有可能导致匹配失败，所以引入解析库来帮助获取

XPath是XML路径语言，但是同时也适用于HTML文档的搜索

常用规则

表达式	描述
nodename	选取此节点的所有子节点
/	从当前节点选取直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

常用匹配规则：

1	//title[@lang='eng']

它代表选择所有名称为title,同时属性lang的值为eng的节点

实例

from lxml import etree

text='''
<div>
<ul>
<li class=""item-0><a href="link1.html">first item</a></li>
<li class=""item-1><a href="link2.html">second item</a></li>
<li class=""item-2><a href="link3.html">third item</a></li>
<li class=""item-3><a href="link4.html">fourth item</a>
</ul>
</div>
'''

html=etree.HTML(text)
result=etree.tostring(html)
print(result.decode('utf-8'))


输出结果：
<html><body><div>
<ul>
<li class="" item-0=""><a href="link1.html">first item</a></li>
<li class="" item-1=""><a href="link2.html">second item</a></li>
<li class="" item-2=""><a href="link3.html">third item</a></li>
<li class="" item-3=""><a href="link4.html">fourth item</a>
</li></ul>
</div>
</body></html>

首先声明了一段HTML文本，调用HTML类进行初始化，这里就成功构造了一个XPath解析对象

可以看到缺一个li标签的结尾，但是etree模块可以自动修正HTML文本

再调用tostring()方法即可输出修正后的结果，但是结果是bytes类型，使用decode(）方法转成str类型

li标签补全了，还自动添加了body html节点

所有节点

我们一般会用//开头的XPath规则来选取所有符合要求的节点

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//*')
print(result)

这里使用*代表匹配所有节点，也就是HTML文本中的所有节点都会被获取，返回结果是一个列表，每个元素是Element类型，其后跟着节点名称，如html,body,div,ul,li,a等

如果只想获取li标签

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li')
print(result)
print(result[0])

想取出某个元素，使用[number]就行

子节点

我们可以通过/或//查找元素的子节点或子孙节点

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li/a')
print(result)

这样就获取了li节点下的所有a子节点

此处的/用于选取直接子节点，如果想要获取所有子孙节点，就可以使用//

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li//a')
print(result)

这样就获取了li节点下的所有a节点，输出结果一样

但是如果使用//li/a这样的情况，如果li下没有直接的a子节点，那么救无法获取任何匹配结果

父节点

寻找父节点的方式与返回根目录的方式基本一样，使用..来返回

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//a[@href="link4.html"]/../@class')
print(result)

输出结果：
['item-1']

这里可以看到基本和目录方式一致，想要查找的a子节点的属性，之后跟上想要获取的父节点

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result)

输出结果：
['item-1']

也可以使用parent::来获取父节点

属性匹配

我们可以使用@符号来进行属性过滤

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li[@class="item-0"]')
print(result)

这里使用了@符号限定了class的值为item-0，所以会返回class属性为item-0的所有li节点

文本获取

我们可以使用text()方法来获取节点中的文本

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li[@class="item-0/text()"]')
print(result)

这种方法不太推荐，是个错误示范，返回结果可以不会有，因为/在XPath中的意思是匹配直接子节点，如果没有直接子节点，就不会返回

所以可以使用//或者先访问直接子节点a再text()

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li[@class="item-0//text()"]')
print(result)

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li[@class="item-0/a/text()"]')
print(result)

属性获取

我们知道使用text()可以获取节点内所有文本，节点属性我们可以使用@符号

from lxml import etree

html=etree.parse('./test.html',etree.HTMLParser())
result=html.xpath('//li/a/@href"]')
print(result)

获取所有li节点下所有a节点的href属性

属性多值匹配

有时候一个属性可能有多个值

from lxml import etree

text='''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html=etree.HTML(text)
result=html.xpath('//li[@class="li"]/a/text()')
print(result)

这个例子中li有两个属性li和li-first，使用之前的方法就无法匹配了，这样就引入我们的contains()方法，第一个参数传入属性名称，第二个参数传入属性值

from lxml import etree

text='''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html=etree.HTML(text)
result=html.xpath('//li[contains(@class,"li")]/a/text()')
print(result)

输出结果：
first item

这样就能取出文本内容了

多属性匹配

还要一种情况就是多个属性确定一个节点，这个时候就需要使用and运算符来连接

from lxml import etree

text='''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html=etree.HTML(text)
result=html.xpath('//li[contains(@class,"li") and @name="item"]/a/text()')
print(result)

这里li又增加了一个属性name。要确定就需要同时根据class和name属性来选择，一个条件是class中的li字符串，一个是name属性为item字符串

除了and，XPath还有很多运算符

运算符	描述	实例	返回值
or	或	age=19 or age=20	如果age=19，则返回true。如果是age=21，则返回false
and	与	age>19 and age<21	如果age=20，则返回true
mod	计算除法的余数	5 mod 2	1
+	加法	6+4	10
-	减法	6-4	2
*	乘法	6*4	24
div	除法	8 div 4	2
=	等于	age=19	age=19，则返回true

大于小于，小于（大于）等于，不等于用法与以往相同不多介绍

Beautiful Soup

Beautiful Soup和XPath类似，也是一个解析库，但是相比较XPath更加方便快捷

Beautiful Soup会自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码，你不需要考虑编码方式，除非文档没有指定编码方式

解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,”html.parser”)	Python的内置标准库，执行速度适中，文档容错率强	Python2.7.3及Python3.2.2之前的版本容错率差
lxml HTML解析库	BeautifulSoup(markup,”lxml”)	速度快，文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup,”xml”)	速度快，唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,”html5lib”)	最好的容错性，以浏览器方式解析文档，生成HTML5格式的文档	速度慢，不依赖外部扩展

基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

输出结果：
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

prettify()方法可以把要解析的字符串以标准的缩进格式输出，这里可以看到输出结果中包括了body和html节点，说明了Beautiful Soup可以自动更正格式

soup.tile可以选出HTML中的title节点，string属性可以直接得到里面的文本

节点选择器

直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本

选择元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(type(soup.title))
print(soup.title.string)
print(soup.head)
print(soup.p)

输出结果：
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<head><title>The Dormouse's story</title></head>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

打印title节点的选择结果，输出结果就是title节点家里面的文字内容，类型是bs4.element.Tag类型，这是Beautiful Soup中一个重要的数据结构，经过选择器选择后结果都是这种Tag类型，Tag类型具有一些属性，例如string

我们还注意到，只输出了一个p标签的内容，可以得到这种选择方式只会选择到第一个匹配的节点

提取信息

使用string属性可以获取文本的值，节点属性的值有几种获取方法：

获取名称

使用name属性获取节点的名称，选取title节点，然后调用name属性就可以获得节点名称

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

输出结果：
title

获取属性

每个节点可能有多个属性，比如id和class等，选择这个节点元素后，调用attrs获取所有的属性

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs)
print(soup.p.attrs['name'])

输出结果：
{'class': ['title'], 'name': 'dromouse'}
dromouse

可以看到attrs返回结果是字典类型

还有一种更简单的获取方式，不用attrs

print(soup.p['name'])
print(soup.p['class'])

输出结果：
dromouse
['title']

由于一个节点元素可有多个class,所以class返回的是列表

嵌套选择

我们如果想获取head节点元素中的head元素，就可以使用嵌套查询，很简单，就是在选中元素上再选中元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

输出结果;
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

这样就实现了嵌套选择节点

关联选择

有时候不能一步选到想要的节点元素，需要先选中某一个节点元素，然后以它为基准再选择它的子节点，父节点，兄弟节点等

子节点和子孙节点

选取节点元素后，想要获取它的子节点和子孙节点可以调用contents属性

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
print(soup.p.contents)

输出结果：
['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, '\n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

可以看到返回结果是列表形式，p节点中既包含文本，又包含节点，最后会将把他们以列表形式统一返回，值得注意的是列表中的每个元素都是p节点的直接子节点，content属性返回的结果是直接子节点的列表

我们还可以使用children属性获得相应的结果

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

输出结果：
<list_iterator object at 0x000002711FDF8520>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4 
            and
            
5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.

同样是HTML文本，children属性来选择，返回结果是生成器类型

如果想要得到所有的子孙节点，可以调用descendants属性：


from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)
    
输出结果：
<generator object Tag.descendants at 0x000002005FC8AC10>
0 
            Once upon a time there were three little sisters; and their names were
            
1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9 
            and
            
10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

我们可以看到这次输出结果就包含了span节点，也就是输出了所有子节点，包括子孙节点

父节点和祖先节点

想要获得某个节点元素的父节点，可以调用parent属性：

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

输出结果：
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>

我们选择的是第一个a标签的父节点元素，所以就是p节点，返回的内容便是p节点中所有的内容

如果我们想获取所有的祖先节点，可以调用parents属性

html = """
<html>
    <body>
        <p class="story">
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(type(soup.a.parents))
print(list(enumerate(soup.a.parents)))

输出结果：
<class 'generator'>
[(0, <p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>), (1, <body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body>), (2, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>), (3, <html>
<body>
<p class="story">
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
</body></html>)]

我们可以看到，使用了列表和枚举类型来输出，输出的第一个元素是p标签所有的内容，然后是body，也就是p标签的父节点，之后是html节点也就是body的父节点

总结

1.XPath中/为访问直接子节点，//为访问所有子节点

2.在查询子节点的时候，虽然可以使用/，但是不推荐，应该如果没有直接子节点就不会返回内容

3.在使用text()方法的时候，推荐使用//，这样可以确保查找到所有子节点，而不会导致没有直接子节点从而没有返回结果的情况

4.属性匹配是中括号加属性名和值来限定某个属性，如[@href=”link1.html”]，而此处的@href指的是获取节点的某个属性，二者需要做好区分

5.Beautiful Soup比XPath更加方便快捷，他们两者的使用方法很类似，XPath用//代表访问子孙节点，而BS使用descendants属性，使用更加方便，便于理解

6.children属性只能获取直接子节点，和XPath中的/类似

7.XPath中并没有提供方法获取父节点或祖先节点，BS中可以使用parent和parents属性来获取父节点和祖先节点

8.注意BS在使用parent或parents属性时，输出的内容是由内到外遍历，也就是从小到大以此输出该节点中的所有内容

9.如果某个节点中包含了子节点，那么这个子节点中的文本内容在输出父节点时，也属于节点，会随着一起输出