Python网络爬虫与信息提取
文本工具类IDE
- IDLE — 自带、默认、常用(适用于Python入门)
- Sublime Text — 第三方专用编程工具
集成工具类IDE
- PyCharm
- Anaconda & Spyder
Requests库入门
Request库的安装
终端运行下列命令
1 | $pip install requests |
获得源码
1 | $git clone git://github.com/kennethreitz/requests.git |
也可以下载tarball
1 | $curl -OL https://github.com/requests/requests/tarball/master |
现在完成后,使用如下命令进行安装
1 | $cd requests |
简单使用
1 | import requests |
Requests库的7个主要方法
Requests库的get()
使用方法
1
requests.get(url,params=None,**kwargs)
- url:拟获取页面的url链接
- params:url中的额外参数,字典或字节流格式,可选
- **kwargs:12个控制访问的参数
Response对象的属性
栗子:
1 | import requests |
r.encoding:如果header中不存在charset,则认为编码为ISO-8859-1
Requests库的异常
状态响应码status_code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19'https://www.baidu.com') r = requests.get(
r.status_code
200
# 为方便引用,Requests还有一个内置的状态码查询对象
r.status_code == requests.codes.ok
True
# 如果是一个错误请求,可以通过Response.raise_for_status()来抛出异常
'http://httpbin.org/status/404') bad_r = requests.get(
bad_r.status_code
404
bad_r.raise_for_status()
Traceback (most recent call last):
File "requests/models.py", line 832, in raise_for_status
raise http_error
requests.exceptions.HTTPError: 404 Client Error
# r.status == 200,当我们调用raise_for_status()时,得到的是
r.raise_for_status()
None爬取网页的通用代码框架TTP
1
2
3
4
5
6
7
8
9
10
11
12
13
14import requests
def getHTMLText(url):
try:
r = requests.get(url,time=30)
r.raise_for_status() #如果状态不是200,将引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return "产生异常"
if __name__ = "__main__":
url = 'https://www.baidu.com'
print(getHTMLText(url))
HTTP协议
HTTP,Hypertext Transfer Protocol,超文本传输协议。
HTTP是一个基于“请求与响应”模式的、装状态的应用层协议。
HTTP协议采用URL作为定位网络资源的标识
URL格式
http://host[:port][path]
- host:合法的Internet主机域名或IP地址
- port:端口号,缺省端口为80
- path:请求资源的路径
HTTP协议对资源的操作
理解PATCH和PUT的区别
- 采用PATCHA,仅向URL提交局部更新的请求
- 采用PUT,必须将所有字段一并提交到URL,未提交字段被删除
Requests库的head()方法
1
2
3
4
5'http://www.baidu.com') r = requests.head(
r.headers
{'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Connection': 'Keep-Alive', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html', 'Date': 'Sun, 01 Jul 2018 07:38:20 GMT', 'Last-Modified': 'Mon, 13 Jun 2016 02:50:26 GMT', 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18'}
r.text
''Requests库的post()方法
1
2
3
4
5
6
7
8
9
10
11'key1':'value1','key2':'value2'} payload = {
'http://www.baidu.com/post',data = payload) r = requests.post(
print(r.text)
{ # 向URL POST一个字典,自动编码为form(表单)
...
"form":{
"key2":"value2",
"key1":"value2"
},
...
}Requests库的put()方法
1
2
3
4
5
6
7
8
9
10
11
12
13'key1':'value1','key2':'value2'} payload = {
"http://www.baidu.com/put',data = payload") r = requests.put(
print(r.text)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>405 Method Not Allowed</title>
</head>
<body>
<h1>Method Not Allowed</h1>
<p>The requested method PUT is not allowed for the URL /put',data=payload.</p>
</body>
</html>
Requests库主要方法解析
requests.request(method,url,**kwargs)
method:请求方式,对应get/put/post等7种
- r = requests.request(‘GET’,url,*kwargs)
- r = requests.request(‘HEAD’,url,*kwargs)
- r = requests.request(‘POST’,url,*kwargs)
- r = requests.request(‘PUT’,url,*kwargs)
- r = requests.request(‘PATCH’,url,*kwargs)
- r = requests.request(‘delete’,url,*kwargs)
- r = requests.request(‘OPTIONS’,url,*kwargs)
url:获取页面的url链接
**kwargs:控制访问参数,共13个
params:字典或字节序列,作为参数增加到url种
1
2
3
4'key1':'value1','key2':'value2'} kv = {
'GET','http://www.baidu.com/s',params=kv) r = requests.request(
print(r.url)
http://www.baidu.com/s?key1=value1&key2=value2data:字典、字节序列或文件对象,作为Request的内容,向服务器提交资源时使用
1
2
3
4'key1':'value1','key2':'value2'} kv = {
'POST','http://www.baidu.com/s',data=kv) r = requests.request(
'主体内容' body =
'POST','http://www.baidu.com/s',data=body) r = requests.request(json:JSON格式的数据,作为Request的内容
1
2'key1':'value1'} kv = {
'POST','http://www.baidu.com/s',json=kv) r = requests.request(headers:字典,HTTP定制头
1
2'user-agent':'Chrome/10'} hd = {
'POST','http://python123.io/ws',headers=hd) r = requests.request(cookies:字典或CookieJar,Request中的cookie
auth:元组,支持HTTP认证功能
file:字典类型,传输文件
1
2'file':open('data.xls','rb')} fs = {
'POST','http://python123.io/ws',files=fs) r = requests.request(timeout:设定超时时间,秒为单位
1
'GET','http://www.baidu.com',timeout=10) r = requests.request(
proxies:字典类型,设定访问代理服务器,可以增加登录认证
1
2'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1:4321'} pxs = {
'GET','http://www.baidu.com',proxies=pxs) r = requests.request(allow_redirects:True/False,默认为True,重定向开关
stream:True/False,默认为True,获取内容立即下载开关
verify:True/False,默认为True,认证SSL证书开关
cert:本地SSL证书路径
网络爬虫的盗亦有道
- 网络爬虫的尺寸
网络爬虫的法律风险
- 服务器上的数据有产权归属
- 网络爬虫获取数据后牟利将带来法律风险
网络爬虫泄露数据
网络爬虫的限制
- 来源审查:判断User-Agent进行限制
- 检查来访HTTP协议头的User-Agent域,只响应浏览器或友好爬虫的访问
- 发布公告:Robots协议
- 告知所有爬虫网站的爬取策略,要求爬虫遵守
- 来源审查:判断User-Agent进行限制
Robots协议(Robots Exclusion Standard 网络爬虫排除标准)
作用:网站告知网络爬虫哪些页面可以抓取,哪些不行
形式:在网站根目录下的robots.txt文件
-
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15User-agent: *
Disallow: /?*
Disallow: /pop/*.html
Disallow: /pinpai/*.html?*
User-agent: EtaoSpider
Disallow: /
User-agent: HuihuiSpider
Disallow: /
User-agent: GwdangSpider
Disallow: /
User-agent: WochachaSpider
Disallow: /
# 注释 * 代表所有 /代表根目录
User-agent: *
Disallow: / 遵守方式
- 网络爬虫:自动或人工识别robots.txt,再进行内容爬取
- 约束性:Robots协议是建议但非约束性,网络爬虫可以不遵守,但存在法律风险
理解
Requests库爬取实例
实例1:京东商品页面的爬取
1 | import requests |
实例2:亚马逊商品的爬取
1 | import requests |
实例3:百度和360搜索关键字提交
- 百度
1 | import requests |
- 360
1 | import requests |
实例4:网络图片的爬取和存储
1 | import requests |
实例5:IP地址归属地自动查询
1 | import requests |
BeautifulSoup入门
BeautifulSoup库的安装
使用pip安装
1
$pip3 install BeautifulSoup4
源码安装
通过setup.py安装
1
$Python setup.py install
安装解析器
lxml
1
$pip3 install lxml
html5lib
1
$pip3 install html5lib
主要解析器及使用方法
测试
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31import requests
'https://python123.io/ws/demo.html') r = requests.get(
demo = r.text
from bs4 import BeautifulSoup
'html.parser') # 需要解析的html格式的内容和解析器 soup = BeautifulSoup(demo,
print(soup.prettify())
<html>
<head>
<title>
This is a python demo page
</title>
</head>
<body>
<p class="title">
<b>
The demo python introduces several python courses.
</b>
</p>
<p class="course">
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
Basic Python
</a>
and
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
Advanced Python
</a>
.
</p>
</body>
</html>
BeautifulSoup库的基本元素
BeautifulSoup库是解析、遍历、维护“标签树”的功能库
BeautifulSoup类的基本元素
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43import requests
from bs4 import BeautifulSoup
"http://python123.io/ws/demo.html") r = requests.get(
demo = r.text
"html.parser") soup = BeautifulSoup(demo,
soup.title
<title>This is a python demo page</title>
tag = soup.a
tag
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
soup.a.name
'a'
soup.a.parent.name
'p'
soup.a.parent.parent.name
'body'
tag.attrs
{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
'class'] tag.attrs[
['py1']
'href'] tag.attrs[
'http://www.icourse163.org/course/BIT-268001'
type(tag.attrs)
<class 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
>>> soup.a.string
'Basic Python'
>>> soup.p.string
'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>
>>> newsoup = BeautifulSoup("<b><!-- This is a comment --></b><p>This is not a comment</p>","html.parser")
# 不常用
newsoup.b.string
' This is a comment '
type(newsoup.b.string)
<class 'bs4.element.Comment'>
>>> newsoup.p.string
'This is not a comment'
>>> type(newsoup.p.string)
<class 'bs4.element.NavigableString'>
基于bs4库的HTML内容遍历方法
遍历方式
标签树的下行遍历
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20from bs4 import BeautifulSoup
"html.parser") soup = BeautifulSoup(demo,
soup.head
<head><title>This is a python demo page</title></head>
soup.head.contents
[<title>This is a python demo page</title>]
soup.body.contents
['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
len(soup.body.contents)
5
1] soup.body.contents[
<p class="title"><b>The demo python introduces several python courses.</b></p>
# 遍历儿子节点
for child in soup.body.children:
print(child)
# 遍历子孙节点
for child in soup.body.descendants:
print(children)标签树的上行遍历
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18"html.parser") soup = BeautifulSoup(demo,
soup.title.parent
<head><title>This is a python demo page</title></head>
soup.html.parent
<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>
soup.parent
# 标签树的上行遍历
"html.parser") soup = BeautifulSoup(demo,
for parent in soup.a.parents:
if parent is None:
print(parent)
else:
print(parent.name)标签树的平行遍历
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16"html.parser") soup = BeautifulSoup(demo,
soup.a.next_sibling
' and '
soup.a.next_sibling.next_sibling
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
soup.a.previous_sibling
'Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n'
soup.a.previous_sibling.previous_sibling
# 标签树的平行遍历
# 遍历后续节点
for sibling in soup.a.next_siblings
print(sibling)
# 遍历前续节点
for sibling in soup.a.previous_siblings:
print(sibling)总结
基于bs4库的HTML格式输出
prettify()方法可以格式化输出HTML文本
1
2
3"html.parser") soup = BeautifulSoup(demo,
soup.prettify()
print(soup.prettify())
信息标记的三种形式
三种形式
- XML – eXtensible Markup Language
- JSON – JavaScript Object Notation
- YAML –
三种信息标记形式的比较
信息提取的一般方法
基于bs4库的HTML内容查找方法
<>.find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型,存储查找结果
name:对标签名称的检索字符串
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20'a') soup.find_all(
[<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
'a','b']) soup.find_all([
[<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
for tag in soup.find_all(True):
print(tag.name)
html
head
title
body
p
b
p
a
a
import re
for tag in soup.find_all(re.compile('b')):
print(tag.name)
body
battrs:对标签属性值的检索字符窜,可标注属性检索
recursive:是否对子孙全部检索,默认True
string:\<>…\</>中字符串区域的检索字符串
\
(…)等价于\ .find_all(…) soup(…)等价于soup.find_all(…)
七个常用扩展方法
实例1:中国大学排名定向爬虫
功能描述
- 输入:大学排名URL连接
- 输出:大学排名信息的屏幕输出(排名,大学名称,总分)
- 技术路线:requests-bs4
- 定向爬去:仅对输入URL进行爬取,不扩展爬取
程序的结构设计
- 从网络上获取大学排名网页内容 – getHTMLText()
- 提取网页内容中信息到合适的数据结构 – fillUnivList()
- 利用数据结构展示并输出结果 – fillUnivList()
代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url, timeout=30)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return "except"
def fillUnivList(ulist, html):
soup = BeautifulSoup(html, 'html.parser')
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string, tds[1].string, tds[3].string])
def printUnivList(ulist, num):
tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名", "学校", "分数",chr(12288)))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0], u[1], u[2],chr(12288)))
if __name__ == '__main__':
uinfo = []
url = 'http://zuihaodaxue.com/zuihaodaxuepaiming2018.html'
html = getHTMLText(url)
fillUnivList(uinfo, html)
printUnivList(uinfo, 100)
正则表达式
正则表达式是用来简洁表达一组字符串的表达式。
编译:将符合正则表达式语法的字符串转换成正则表达式特征
语法
Re库的基本使用
Re库是Python的标准库,主要用于字符串匹配,调用方式:import re
正则表达式的表示类型
- raw string类型(原生字符串类型)
- string类型
Re库主要功能函数
re.search(pattern,string,flags=0)
pattern:正则表达式的字符串或原生字符串表示
string:待匹配字符串
flags:正则表达式使用时的控制标记
1
2
3
4
5
6import re
r'[1-9]\d{5}','bit 100081') match = re.search(
if match:
print(match.group(0))
100081
re.match(pattern,string,flags=0)
pattern:正则表达式的字符串或原生字符串表示
string:待匹配字符串
flags:正则表达式使用时的控制标记
1
match = re.match(r'[1-9]\d{5}','100081 bit')
re.findall(pattern.string.flags=0)
pattern:正则表达式的字符串或原生字符串表示
string:待匹配字符串
flags:正则表达式使用时的控制标记
1
2
3r'[1-9]\d{5}','bit100081 tsu 100084') ls = re.findall(
ls
['100081', '100084']
re.split(pattern,string,maxsplit=0,flags=0)
pattern:正则表达式的字符串或原生字符串表示
string:待匹配字符串
maxsplit:最大分个数,剩余部分作为最后一个元素输出
flags:正则表达式使用时的控制标记
1
2
3r'[1-9]\d{5}','bit100081 tsu100084',maxsplit=1) ls = re.split(
ls
['bit', ' tsu100084']
re.finditer(pattern,string,flags=0)
pattern:正则表达式的字符串或原生字符串表示
string:待匹配字符串
flags:正则表达式使用时的控制标记
1
2
3
4
5
6for m in re.finditer(r'[1-9]\d{5}','bit100081 tsu100084'):
if m:
print(m.group(0))
100081
100084
re.sub(pattern,repl,string,count=0,flags=0)
pattern:正则表达式的字符串或原生字符串表示
repl:替换匹配字符串的字符串
string:待匹配字符串
count:匹配的最大替换次数
flags:正则表达式使用时的控制标记
1
2
3r'[123]','456','1237878712398798123',count=2) re = re.sub(
re
'45645637878712398798123'
Re库的另一种等价用法
函数式用法:一次性操作
1
r'[1-9]\d{5}','bit 100081') rst = re.search(
面向对象用法:编译后多次操作
1
2r'[1-9]\d{5}') pat = re.compile(
'bit 100081') rst = pat.search(
regex = re.compile(pattern,flags=0) – 将正则表达式的字符串形式编译成正则表达式对象
- regex.search()
- regex.match()
- regex.findall()
- regex.split()
- regex.finditer()
- regex.sub()
Re库的Match函数
Match对象的属性
Match对象的方法
Re库的贪婪匹配和最小匹配
贪婪匹配:Re库默认采用贪婪匹配,即输出匹配最长的子串
最小匹配
实例2 淘宝商品比价定向爬虫
功能描述
- 目标:获取淘宝搜索页面的信息,提取其中的商品名称和价格
- 理解:淘宝的搜索接口和翻页的处理
- 技术路线:requests-re
程序的设计结构
- 提交商品搜索请求,循环获取页面
- 对于每个商品,提取商品名称和价格信息
- 将信息输出到屏幕上
代码实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46import requests
import re
def getHTMLText(url):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parsePage(ilt,html):
try:
plt = re.findall(r'"view_price":"[\d.]*"',html)
tlt = re.findall(r'"raw_title":".*?"',html)
for i in range(len(plt)):
price = eval(plt[i].split(':')[1])
title = eval(tlt[i].split(':')[1])
ilt.append([price,title])
except:
print("")
def printGoodsList(ilt):
tplt = "{:4}\t{:8}\t{:16}"
print(tplt.format("xuhao","jiage","shangpin name"))
count = 0
for g in ilt:
count = count + 1
print(tplt.format(count,g[0],g[1]))
def main():
goods = 'shubao'
depth = 2
start_url = 'https://s.taobao.com/search?q=%' + goods
infoList = []
for i in range(depth):
try:
url = start_url + '&s=' + str(44*i)
html = getHTMLText(url)
parsePage(infoList,html)
except:
continue
printGoodsList(infoList)
main()
实例3 股票数据定向爬虫
功能描述
- 目标:获取上交所和深交所所有股票的名称和交易信息
- 输出:保存到文件中
- 技术路线:requests-bs4-re
程序结构设计
- 从东方财富网选取股票列表
- 根据股票列表逐个到百度股票获取个股信息
- 将结果存储到文件
代码实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68import requests
from bs4 import BeautifulSoup
import traceback
import re
def getHTMLText(url,code='utf-8'):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = code
return r.text
except:
return ""
def getStockList(lst, stockURL):
html = getHTMLText(stockURL)
soup = BeautifulSoup(html, 'html.parser')
a = soup.find_all('a')
for i in a:
try:
href = i.attrs['href']
lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
except:
continue
def getStockInfo(lst, stockURL, fpath):
count = 0
for stock in lst:
url = stockURL + stock + ".html"
html = getHTMLText(url)
try:
if html == "":
continue
infoDict = {}
soup = BeautifulSoup(html, "html.parser")
stockInfo = soup.find_all('div', attrs={'class': 'stock-bets'})[0]
name = stockInfo.find(attrs={'class':'bets-name'})[0]
infoDict.update({'stockname': name.text.split()[0]})
keyList = stockInfo.find_all('dt')
valueList = stockInfo.find_all('dd')
for i in range(len(lst)):
# if i == 20:
# break
key = keyList[i].text
val = valueList[i].text
infoDict[key] = val
with open(fpath, 'a', encoding='utf-8') as f:
f.write(str(infoDict) + '\n')
count = count + 1
print('\r当前速度:{:.2f}%'.format(count*100/len(lst)), end='')
except:
count = count + 1
print('\r当前速度:{:.2f}%'.format(count*100/len(lst)),end='')
# traceback.print_exc()
continue
def main():
stock_list_url = "http://quote.eastmoney.com/stocklist.html"
stock_info_url = "https://gupiao.baidu.com/stock/"
output_file = "/Users/entercoder/Documents/stock.txt"
slist = []
getStockList(slist, stock_list_url)
getStockInfo(slist, stock_info_url, output_file)
main()