Python3 正则表达式详解 – 来学习啦 – 编程乐园，实战驱动的编程学习平台

Python3 正则表达式详解

天天向上

发布： 2025-03-15 19:37:10

原创

334 人浏览过

Python 的 re 模块提供了强大的正则表达式功能，适用于字符串匹配、搜索、替换等操作。下面是 Python3 正则表达式的深入学习内容：

1. 导入 `re` 模块

import re

2. 正则表达式基础

正则表达式用于匹配特定的字符串模式，例如：

\d 匹配数字
\w 匹配字母、数字和下划线
. 匹配任意字符（换行符除外）
* 匹配 0 或更多个前面的字符
+ 匹配 1 或更多个前面的字符
? 匹配 0 或 1 个前面的字符
{n,m} 匹配 n 到 m 次
^ 匹配字符串的开始
$ 匹配字符串的结尾

示例：

pattern = r'\d+'  # 匹配一个或多个数字
text = "今天是 2025 年 3 月 15 日"
match = re.findall(pattern, text)
print(match)  # 输出：['2025', '3', '15']

3. `re` 模块主要方法

3.1 `re.match()`

从字符串开头匹配正则表达式，如果匹配成功，则返回匹配对象，否则返回 None。

pattern = r'hello'
text = "hello world"
match = re.match(pattern, text)
if match:
    print("匹配成功:", match.group())  # 输出：hello

3.2 `re.search()`

在整个字符串中搜索第一个匹配项，返回匹配对象。

pattern = r'\d+'
text = "今天是 2025 年"
match = re.search(pattern, text)
if match:
    print("找到数字:", match.group())  # 输出：2025

3.3 `re.findall()`

返回所有匹配的字符串列表。

pattern = r'\d+'
text = "2025 年 3 月 15 日"
matches = re.findall(pattern, text)
print(matches)  # 输出：['2025', '3', '15']

3.4 `re.finditer()`

返回一个迭代器，每个元素是一个 match 对象。

pattern = r'\d+'
text = "2025 年 3 月 15 日"
matches = re.finditer(pattern, text)
for match in matches:
    print("找到数字:", match.group())  # 逐个输出 2025, 3, 15

3.5 `re.sub()`

替换字符串中的匹配内容。

pattern = r'\d+'
text = "2025 年 3 月 15 日"
new_text = re.sub(pattern, "XX", text)
print(new_text)  # 输出：XX 年 XX 月 XX 日

3.6 `re.split()`

使用匹配的内容拆分字符串，返回列表。

pattern = r'\s+'
text = "hello   world  python"
result = re.split(pattern, text)
print(result)  # 输出：['hello', 'world', 'python']

4. 正则表达式进阶

4.1 组 (`()`的作用)

可以使用 () 创建匹配组，并使用 .group(n) 提取内容。

pattern = r'(\d{4})-(\d{2})-(\d{2})'
text = "日期是 2025-03-15"
match = re.search(pattern, text)
if match:
    print("年份:", match.group(1))  # 2025
    print("月份:", match.group(2))  # 03
    print("日期:", match.group(3))  # 15

4.2 非贪婪匹配

默认情况下，* 和 + 是贪婪的（尽可能多地匹配）。
使用 ? 让它变为非贪婪匹配。

text = '<div>hello</div><div>world</div>'
greedy_match = re.search(r'<div>.*</div>', text)
lazy_match = re.search(r'<div>.*?</div>', text)

print("贪婪匹配:", greedy_match.group())  # 输出：<div>hello</div><div>world</div>
print("非贪婪匹配:", lazy_match.group())  # 输出：<div>hello</div>

4.3 `re.compile()`

预编译正则表达式，提高匹配效率。

pattern = re.compile(r'\d+')
text = "今天是 2025 年 3 月 15 日"
matches = pattern.findall(text)
print(matches)  # ['2025', '3', '15']

4.4 `re.VERBOSE` 模式

允许使用多行编写正则表达式，并添加注释，提高可读性。

pattern = re.compile(r"""
    (\d{4})  # 年份
    -        # 连接符
    (\d{2})  # 月份
    -        # 连接符
    (\d{2})  # 日期
""", re.VERBOSE)

text = "今天的日期是 2025-03-15"
match = pattern.search(text)
if match:
    print(match.groups())  # ('2025', '03', '15')

5. 常见正则表达式应用

5.1 匹配手机号

pattern = r'1[3-9]\d{9}'
text = "我的手机号码是 13812345678"
match = re.search(pattern, text)
if match:
    print("匹配到手机号:", match.group())  # 13812345678

5.2 匹配邮箱

pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
text = "联系我：example@gmail.com"
match = re.search(pattern, text)
if match:
    print("匹配到邮箱:", match.group())  # example@gmail.com

5.3 提取 URL

pattern = r'https?://[a-zA-Z0-9./-]+'
text = "访问我的博客：https://www.example.com"
match = re.search(pattern, text)
if match:
    print("匹配到 URL:", match.group())  # https://www.example.com

Python3 正则表达式深入学习（进阶篇）

在基础知识的学习之后，我们继续深入探讨 Python3 正则表达式的高级应用，包括高级匹配模式、断言、零宽断言、负向匹配等。

6. `re` 模块的高级用法

6.1 `re.MULTILINE`（多行模式）

默认情况下，^ 仅匹配字符串的开头，$ 仅匹配字符串的结尾。但如果启用了 re.MULTILINE，则 ^ 和 $ 还会匹配每一行的开头和结尾。

import re

text = """Hello World
Python is great
Regular Expressions"""

pattern = r"^Python"
match = re.search(pattern, text, re.MULTILINE)

if match:
    print("匹配成功:", match.group())  # 输出：Python

🔹 如果不加 re.MULTILINE，默认 ^ 只匹配整个字符串的开头，而不会匹配第二行的 Python。

6.2 `re.IGNORECASE`（忽略大小写匹配）

忽略字母大小写，使正则表达式的匹配更加灵活。

pattern = r"hello"
text = "Hello World"

match = re.search(pattern, text, re.IGNORECASE)
if match:
    print("匹配成功:", match.group())  # 输出：Hello

6.3 `re.DOTALL`（让 `.` 匹配换行符）

默认情况下，. 不能匹配换行符 \n。使用 re.DOTALL 让 . 可以匹配换行符。

pattern = r"Hello.*World"
text = "Hello\nWorld"

match = re.search(pattern, text, re.DOTALL)
if match:
    print("匹配成功:", match.group())  # 输出：Hello\nWorld

7. 断言（Assertions）

7.1 先行断言（Lookahead）

先行断言 (?=) 用于匹配某个模式前面的文本，而不会消耗字符。

pattern = r'\d+(?=元)'
text = "商品价格是 100元 和 200元"

matches = re.findall(pattern, text)
print(matches)  # 输出：['100', '200']

🔹 解释： \d+ 代表匹配一个或多个数字，(?=元) 代表这些数字后面必须是“元”，但“元”不会被匹配出来。

7.2 负向先行断言（Negative Lookahead）

负向先行断言 (?!) 仅匹配不符合某种模式的字符串。

pattern = r'hello(?! world)'
text1 = "hello world"
text2 = "hello python"

match1 = re.search(pattern, text1)
match2 = re.search(pattern, text2)

print("匹配结果1:", bool(match1))  # False
print("匹配结果2:", bool(match2))  # True

🔹 解释： hello(?! world) 代表匹配 hello，但后面不能跟 world。

7.3 后行断言（Lookbehind）

后行断言 (?<=) 用于匹配某个模式后面的文本，而不会消耗字符。

pattern = r'(?<=\$)\d+'
text = "价格是 $100 和 $200"

matches = re.findall(pattern, text)
print(matches)  # 输出：['100', '200']

🔹 解释： (?<=\$) 代表匹配 $ 后面的数字，$ 不会被匹配出来。

7.4 负向后行断言（Negative Lookbehind）

负向后行断言 (?<!) 仅匹配 前面不符合 某种模式的字符串。

pattern = r'(?<!\$)\d+'
text = "100 是普通数字，$200 是带符号的金额"

matches = re.findall(pattern, text)
print(matches)  # 输出：['100']

🔹 解释： (?<!\$)\d+ 代表匹配不在 $ 后面的数字。

8. 贪婪与非贪婪匹配（Greedy vs Non-Greedy）

8.1 贪婪匹配

默认情况下，*, +, {} 是贪婪匹配，会匹配尽可能多的字符。

text = "<div>hello</div><div>world</div>"
pattern = r"<div>.*</div>"

match = re.search(pattern, text)
print(match.group())  # 输出：<div>hello</div><div>world</div>

🔹 解释： .* 试图匹配尽可能多的字符，因此匹配了整个 <div>hello</div><div>world</div>。

8.2 非贪婪匹配

通过 *?, +?, {n,m}? 让匹配变得非贪婪，即匹配尽可能少的字符。

pattern = r"<div>.*?</div>"
match = re.search(pattern, text)
print(match.group())  # 输出：<div>hello</div>

🔹 解释： .*? 仅匹配到第一个 </div>，所以返回 <div>hello</div>。

9. 复杂案例

9.1 提取 HTML 标签内容

pattern = r'<title>(.*?)</title>'
text = "<html><head><title>Python 正则表达式</title></head></html>"

match = re.search(pattern, text)
if match:
    print(match.group(1))  # 输出：Python 正则表达式

9.2 解析 URL

pattern = r"https?://(www\.)?([\w.-]+)"
text = "访问 https://www.example.com 或 http://test.com"

matches = re.findall(pattern, text)
print(matches)  
# 输出：[('www.', 'example.com'), ('', 'test.com')]

🔹 解释： 这里提取了 URL 的域名部分。

9.3 替换敏感词

pattern = r'坏话|脏话'
text = "这个人说了一些坏话和脏话"
new_text = re.sub(pattern, "**", text)
print(new_text)  # 输出：这个人说了一些**和**

当然！Python3 正则表达式还有许多更高级的用法，例如：回溯引用、嵌套匹配、动态编译、性能优化、与 regex 模块的区别等。我们继续深入学习！🚀

11. 回溯引用（Backreferences）

回溯引用（Backreference） 允许在匹配模式中引用之前匹配的组，常用于检测重复单词、HTML 标签匹配等。

11.1 检测重复单词

import re

text = "I love love Python!"
pattern = r"\b(\w+)\s+\1\b"

match = re.search(pattern, text)
if match:
    print("找到重复单词:", match.group())  # 输出：love love

🔹 解释：

(\w+) 捕获一个单词
\1 表示引用 第 1 个捕获组
\s+ 代表单词之间有空格
\b 表示单词边界

11.2 HTML 标签的完整匹配

回溯引用可以确保起始和结束标签相同：

text = "<b>Python</b> and <i>Regex</i>"
pattern = r"<(\w+)>(.*?)</\1>"

matches = re.findall(pattern, text)
print(matches)  
# 输出：[('b', 'Python'), ('i', 'Regex')]

🔹 解释：

<(\w+)> 捕获 HTML 标签名
.*? 捕获标签内部内容（非贪婪模式）
</\1> 确保结束标签与起始标签匹配

12. 嵌套匹配（递归匹配）

Python 的 re 模块不支持直接的递归匹配，例如嵌套括号 ((a+b)*c)。但是可以使用 regex 模块：

import regex

text = "(a (b c) d (e (f g)))"
pattern = r"\((?:[^()]++|(?R))*+\)"

matches = regex.findall(pattern, text)
print(matches)  # 输出：['(b c)', '(e (f g))', '(a (b c) d (e (f g)))']

🔹 解释：

(?R) 代表递归调用自身
(?:[^()]++|(?R))*+ 代表匹配无括号内容，或递归匹配括号内部的内容

13. 编译正则表达式（`re.compile()`）

当正则表达式需要多次使用时，可以使用 re.compile() 提前编译，提高效率：

pattern = re.compile(r"\d{3}-\d{4}-\d{4}")

# 使用已编译的正则表达式
print(pattern.findall("电话: 123-4567-8910, 987-6543-2100"))
# 输出：['123-4567-8910', '987-6543-2100']

🔹 优势：

提高执行效率（避免每次都重新解析正则表达式）
代码更清晰（可以将复杂模式存储为变量）

14. `re.sub()` 进阶替换

14.1 使用回调函数 `re.sub()`

如果替换规则较复杂，可以使用 回调函数 处理匹配：

def mask_email(match):
    return match.group(1) + "@****.com"

pattern = r"(\w+)@\w+\.\w+"
text = "邮箱: alice@example.com, bob@gmail.com"
masked_text = re.sub(pattern, mask_email, text)

print(masked_text)  
# 输出：邮箱: alice@****.com, bob@****.com

🔹 解释：

re.sub() 允许使用自定义函数进行替换
match.group(1) 获取 @ 之前的部分，****.com 隐藏域名

15. `regex` 模块 VS `re` 模块

Python 内置的 re 模块虽然功能强大，但 regex 模块提供更多功能，如：

功能	`re` 模块	`regex` 模块
Unicode 支持	✅	✅
递归匹配	❌	✅
变量命名组	✅	✅
`\N{Unicode名}`	❌	✅

例如 regex 模块支持命名 Unicode：

import regex

text = "𝓟𝓎𝓉𝒽𝑜𝓃 𝒾𝓈 𝒜𝓌𝑒𝓈𝑜𝓂𝑒!"
pattern = r"\p{Letter}+"  # 匹配所有 Unicode 文字字符

matches = regex.findall(pattern, text)
print(matches)  # ['𝓟𝓎𝓉𝒽𝑜𝓃', '𝒾𝓈', '𝒜𝓌𝑒𝓈𝑜𝓂𝑒']

🔹 解释：

\p{Letter} 用于匹配所有字母字符（包括 Unicode）

16. 正则表达式性能优化

正则表达式可能会导致性能瓶颈，以下优化方法可以提升效率：

✅ 1. 避免回溯（Backtracking）

# 低效匹配（回溯较多）
pattern = r"(a+)+b"

# 高效匹配（使用非贪婪模式）
pattern = r"a+b"

✅ 2. 使用 re.compile() 预编译

pattern = re.compile(r"\d{10}")
for _ in range(1000):
    pattern.search("1234567890")  # 预编译后执行更快

✅ 3. 选择合适的匹配方式

re.match() 仅匹配开头，适合确定格式的字符串
re.search() 在整个字符串中查找第一个匹配项
re.findall() 查找所有匹配项，适合批量提取

17. 练习题

17.1 提取日期

目标文本

今天是 2025-03-15, 明天是 2025/03/16, 还有 03-17-2025.

目标
提取所有日期格式：

YYYY-MM-DD
YYYY/MM/DD
MM-DD-YYYY

pattern = r"\b\d{4}[-/]\d{2}[-/]\d{2}\b|\b\d{2}-\d{2}-\d{4}\b"
text = "今天是 2025-03-15, 明天是 2025/03/16, 还有 03-17-2025."

matches = re.findall(pattern, text)
print(matches)  # 输出：['2025-03-15', '2025/03/16', '03-17-2025']

更多详细内容请关注其他相关文章！

1. 导入 re 模块

2. 正则表达式基础

3. re 模块主要方法

3.1 re.match()

3.2 re.search()

3.3 re.findall()

3.4 re.finditer()

3.5 re.sub()

3.6 re.split()

4. 正则表达式进阶

4.1 组 (()的作用)

4.2 非贪婪匹配

4.3 re.compile()

4.4 re.VERBOSE 模式

5. 常见正则表达式应用

5.1 匹配手机号

5.2 匹配邮箱

5.3 提取 URL

Python3 正则表达式深入学习（进阶篇）

6. re 模块的高级用法

6.1 re.MULTILINE（多行模式）

6.2 re.IGNORECASE（忽略大小写匹配）

6.3 re.DOTALL（让 . 匹配换行符）

7. 断言（Assertions）

7.1 先行断言（Lookahead）

7.2 负向先行断言（Negative Lookahead）

7.3 后行断言（Lookbehind）

7.4 负向后行断言（Negative Lookbehind）

8. 贪婪与非贪婪匹配（Greedy vs Non-Greedy）

8.1 贪婪匹配

8.2 非贪婪匹配

9. 复杂案例

9.1 提取 HTML 标签内容

9.2 解析 URL

9.3 替换敏感词

11. 回溯引用（Backreferences）

11.1 检测重复单词

11.2 HTML 标签的完整匹配

12. 嵌套匹配（递归匹配）

13. 编译正则表达式（re.compile()）

14. re.sub() 进阶替换

14.1 使用回调函数 re.sub()

15. regex 模块 VS re 模块

16. 正则表达式性能优化

17. 练习题

17.1 提取日期

1. 导入 `re` 模块

3. `re` 模块主要方法

3.1 `re.match()`

3.2 `re.search()`

3.3 `re.findall()`

3.4 `re.finditer()`

3.5 `re.sub()`

3.6 `re.split()`

4.1 组 (`()`的作用)

4.3 `re.compile()`

4.4 `re.VERBOSE` 模式

6. `re` 模块的高级用法

6.1 `re.MULTILINE`（多行模式）

6.2 `re.IGNORECASE`（忽略大小写匹配）

6.3 `re.DOTALL`（让 `.` 匹配换行符）

13. 编译正则表达式（`re.compile()`）

14. `re.sub()` 进阶替换

14.1 使用回调函数 `re.sub()`

15. `regex` 模块 VS `re` 模块