A compact language to describe text patterns for searching, validation, or extraction.
Definition
A regex (short for regular expression) is a sequence of symbols that defines a pattern of text. It is used to find, validate, replace, or extract fragments of strings in many programming languages (e.g., Python, JavaScript, R, Java, PHP).
Why it’s useful
- Validation: emails, phone numbers, postal codes.
- Data cleaning: normalize formats, remove noise.
- Extraction: capture relevant groups (dates, prices, tags).
- Mass replacement: code refactors, log rewriting.
Basic syntax
.any character (except newline by default)\ddigit;\wword character;\swhitespace- Quantifiers:
*0+,+1+,?0 or 1,{m,n}range - Anchors:
^start,$end - Classes:
[abc]any of these;[^abc]anything except these - Groups:
(...)capturing;(?:...)non-capturing;(?<name>...)named group - Alternation:
a|b - Flags: e.g.
i(case-insensitive),m(multiline),g(global, JS),s(dotall)
Examples
Python
import re
text = "Contact: hello@example.com and sales@shop.co"
pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
# Find all matches
emails = re.findall(pattern, text)
print(emails) # ['hello@example.com', 'sales@shop.co']
# Capture named groups (user and domain)
m = re.search(r"(?<user>[A-Za-z0-9._%+-]+)@(?<domain>[A-Za-z0-9.-]+\.[A-Za-z]{2,})", text)
if m:
print(m.group("user"), m.group("domain"))
# Substitution
anon = re.sub(pattern, "[email]", text)
print(anon)JavaScript
const text = "SKU: AB-12345 and AB-98765";
const re = /\bAB-(\d{5})\b/g;
const ids = [...text.matchAll(re)].map(m => m[1]);
console.log(ids); // ["12345","98765"]
// Basic date validation (YYYY-MM)
const isYearMonth = /^\d{4}-(0[1-9]|1[0-2])$/.test("2025-10");
console.log(isYearMonth); // trueCommon mistakes and best practices
- Greediness:
.*is greedy; use lazy quantifiers.*?when needed. - Escaping: remember to escape meta-characters (
.,+,?,(,),[,],{,},^,$,|,\). - Readability: for long patterns, use verbose/commented mode (
re.VERBOSEin Python) or split into parts. - Performance: avoid patterns that cause catastrophic backtracking; prefer explicit classes and anchors.
- Internationalization: for accented or Unicode characters, enable Unicode-aware modes or libraries.