Python urllib 详细解析

天天向上

发布： 2025-03-18 00:01:52

原创

965 人浏览过

1. `urllib` 简介

urllib 是 Python 内置的一个用于处理 URL 的模块，提供了用于操作 URL 的一系列功能，如获取网页内容、解析 URL、编码/解码 URL 等。urllib 包含多个子模块：

urllib.request：用于打开和读取 URL。
urllib.response：封装了 HTTP 响应内容（通常不直接使用）。
urllib.parse：用于解析和构造 URL。
urllib.error：处理 urllib.request 可能引发的异常。
urllib.robotparser：解析 robots.txt 规则，判断 URL 是否可被爬取。

2. `urllib.request`（发送 HTTP 请求）

urllib.request 主要用于打开和读取 URL（支持 HTTP、HTTPS）。

2.1 发送 GET 请求

import urllib.request

url = "https://www.example.com"
response = urllib.request.urlopen(url)
html = response.read().decode("utf-8")
print(html)

说明

urlopen(url) 发送请求并返回 HTTPResponse 对象。
read() 读取网页内容。
decode("utf-8") 解析内容，防止乱码。

2.2 发送 GET 请求（带参数）

import urllib.request
import urllib.parse

base_url = "https://www.example.com/search"
params = {"q": "python urllib", "page": 1}
query_string = urllib.parse.urlencode(params)
url = f"{base_url}?{query_string}"

response = urllib.request.urlopen(url)
html = response.read().decode("utf-8")
print(html)

说明

urllib.parse.urlencode(params) 将字典转换为查询字符串，如 q=python+urllib&page=1。
f"{base_url}?{query_string}" 拼接 URL。

2.3 发送 POST 请求

import urllib.request
import urllib.parse

url = "https://www.example.com/login"
data = {
    "username": "admin",
    "password": "123456"
}

data_encoded = urllib.parse.urlencode(data).encode("utf-8")  # 需要编码并转换为字节
req = urllib.request.Request(url, data=data_encoded, method="POST")
response = urllib.request.urlopen(req)

print(response.read().decode("utf-8"))

说明

data_encoded = urllib.parse.urlencode(data).encode("utf-8")：将 POST 数据转换为 URL 编码的字节流。
urllib.request.Request(url, data=data_encoded, method="POST")：创建 POST 请求对象。

2.4 添加 Headers（模拟浏览器请求）

import urllib.request

url = "https://www.example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"
}

req = urllib.request.Request(url, headers=headers)
response = urllib.request.urlopen(req)

print(response.read().decode("utf-8"))

说明

使用 headers 伪装为浏览器，避免被服务器拒绝访问。

2.5 处理 HTTP 异常（`urllib.error`）

import urllib.request
import urllib.error

url = "https://www.example.com/notfound"

try:
    response = urllib.request.urlopen(url)
    print(response.read().decode("utf-8"))
except urllib.error.HTTPError as e:
    print("HTTP Error:", e.code, e.reason)
except urllib.error.URLError as e:
    print("URL Error:", e.reason)

说明

HTTPError：处理 HTTP 4xx、5xx 错误。
URLError：处理 URL 解析错误、网络问题等。

3. `urllib.parse`（URL 解析与构造）

3.1 解析 URL

import urllib.parse

url = "https://www.example.com/search?q=python&lang=en"

parsed_url = urllib.parse.urlparse(url)
print(parsed_url)

输出

ParseResult(scheme='https', netloc='www.example.com', path='/search', params='', query='q=python&lang=en', fragment='')

说明

scheme：协议（https）。
netloc：域名（www.example.com）。
path：路径（/search）。
query：查询参数（q=python&lang=en）。

3.2 解析查询参数

import urllib.parse

url = "https://www.example.com/search?q=python&lang=en"
query_params = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)
print(query_params)

输出

{'q': ['python'], 'lang': ['en']}

3.3 URL 编码

import urllib.parse

params = {"q": "Python 编程", "page": 2}
encoded_params = urllib.parse.urlencode(params)
print(encoded_params)  # q=Python+%E7%BC%96%E7%A8%8B&page=2

4. `urllib.robotparser`（解析 `robots.txt`）

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.example.com/robots.txt")
rp.read()

print(rp.can_fetch("*", "https://www.example.com/page"))

说明

rp.can_fetch("*", URL)：检查 robots.txt 是否允许爬取指定 URL。

5. `urllib.response`（HTTP 响应处理）

import urllib.request

url = "https://www.example.com"
response = urllib.request.urlopen(url)

print(response.status)  # HTTP 状态码
print(response.getheaders())  # 响应头
print(response.getheader("Content-Type"))  # 获取特定头部

6. 代理设置

import urllib.request

proxy = urllib.request.ProxyHandler({"http": "http://proxy.example.com:8080"})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)

url = "http://www.example.com"
response = urllib.request.urlopen(url)

print(response.read().decode("utf-8"))

说明

ProxyHandler 允许通过代理访问网页。

7. 超时设置

import urllib.request

url = "https://www.example.com"
try:
    response = urllib.request.urlopen(url, timeout=5)  # 设置超时时间为 5 秒
    print(response.read().decode("utf-8"))
except urllib.error.URLError as e:
    print("请求超时", e.reason)

总结

功能	相关模块	主要方法
发送 HTTP 请求	`urllib.request`	`urlopen()`, `Request()`
处理 HTTP 异常	`urllib.error`	`HTTPError`, `URLError`
解析/构造 URL	`urllib.parse`	`urlparse()`, `urlencode()`
解析 `robots.txt`	`urllib.robotparser`	`RobotFileParser()`
代理、超时设置	`urllib.request`	`ProxyHandler()`, `timeout`

urllib 适用于简单的 HTTP 请求，但如果需要更强大的功能（如会话管理、JSON 解析等），建议使用 requests 模块。更多详细内容请关注其他相关文章！

1. urllib 简介

2. urllib.request（发送 HTTP 请求）