A compact language to describe text patterns for searching, validation, or extraction.
Definition
A regex (short for regular expression) is a sequence of symbols that defines a pattern of text. It is used to find, validate, replace, or extract fragments of strings in many programming languages (e.g., Python, JavaScript, R, Java, PHP).
Why it’s useful
- Validation: emails, phone numbers, postal codes.
- Data cleaning: normalize formats, remove noise.
- Extraction: capture relevant groups (dates, prices, tags).
- Mass replacement: code refactors, log rewriting.
Basic syntax
.
any character (except newline by default)\d
digit;\w
word character;\s
whitespace- Quantifiers:
*
0+,+
1+,?
0 or 1,{m,n}
range - Anchors:
^
start,$
end - Classes:
[abc]
any of these;[^abc]
anything except these - Groups:
(...)
capturing;(?:...)
non-capturing;(?<name>...)
named group - Alternation:
a|b
- Flags: e.g.
i
(case-insensitive),m
(multiline),g
(global, JS),s
(dotall)
Examples
Python
import re
text = "Contact: hello@example.com and sales@shop.co"
pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
# Find all matches
emails = re.findall(pattern, text)
print(emails) # ['hello@example.com', 'sales@shop.co']
# Capture named groups (user and domain)
m = re.search(r"(?<user>[A-Za-z0-9._%+-]+)@(?<domain>[A-Za-z0-9.-]+\.[A-Za-z]{2,})", text)
if m:
print(m.group("user"), m.group("domain"))
# Substitution
anon = re.sub(pattern, "[email]", text)
print(anon)
JavaScript
const text = "SKU: AB-12345 and AB-98765";
const re = /\bAB-(\d{5})\b/g;
const ids = [...text.matchAll(re)].map(m => m[1]);
console.log(ids); // ["12345","98765"]
// Basic date validation (YYYY-MM)
const isYearMonth = /^\d{4}-(0[1-9]|1[0-2])$/.test("2025-10");
console.log(isYearMonth); // true
Common mistakes and best practices
- Greediness:
.*
is greedy; use lazy quantifiers.*?
when needed. - Escaping: remember to escape meta-characters (
.
,+
,?
,(
,)
,[
,]
,{
,}
,^
,$
,|
,\
). - Readability: for long patterns, use verbose/commented mode (
re.VERBOSE
in Python) or split into parts. - Performance: avoid patterns that cause catastrophic backtracking; prefer explicit classes and anchors.
- Internationalization: for accented or Unicode characters, enable Unicode-aware modes or libraries.