Regex (regular expressions)

What is a regex? A clear guide to regular expressions with basic syntax, Python and JavaScript examples, common mistakes, and best practices for validating, searching, and extracting text.

« Back to Glossary Index

A compact language to describe text patterns for searching, validation, or extraction.

Definition

A regex (short for regular expression) is a sequence of symbols that defines a pattern of text. It is used to find, validate, replace, or extract fragments of strings in many programming languages (e.g., Python, JavaScript, R, Java, PHP).

Why it’s useful

  • Validation: emails, phone numbers, postal codes.
  • Data cleaning: normalize formats, remove noise.
  • Extraction: capture relevant groups (dates, prices, tags).
  • Mass replacement: code refactors, log rewriting.

Basic syntax

  • . any character (except newline by default)
  • \d digit; \w word character; \s whitespace
  • Quantifiers: * 0+, + 1+, ? 0 or 1, {m,n} range
  • Anchors: ^ start, $ end
  • Classes: [abc] any of these; [^abc] anything except these
  • Groups: (...) capturing; (?:...) non-capturing; (?<name>...) named group
  • Alternation: a|b
  • Flags: e.g. i (case-insensitive), m (multiline), g (global, JS), s (dotall)

Examples

Python

import re

text = "Contact: hello@example.com and sales@shop.co"
pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"

# Find all matches
emails = re.findall(pattern, text)
print(emails)  # ['hello@example.com', 'sales@shop.co']

# Capture named groups (user and domain)
m = re.search(r"(?<user>[A-Za-z0-9._%+-]+)@(?<domain>[A-Za-z0-9.-]+\.[A-Za-z]{2,})", text)
if m:
    print(m.group("user"), m.group("domain"))

# Substitution
anon = re.sub(pattern, "[email]", text)
print(anon)

JavaScript

const text = "SKU: AB-12345 and AB-98765";
const re = /\bAB-(\d{5})\b/g;

const ids = [...text.matchAll(re)].map(m => m[1]);
console.log(ids); // ["12345","98765"]

// Basic date validation (YYYY-MM)
const isYearMonth = /^\d{4}-(0[1-9]|1[0-2])$/.test("2025-10");
console.log(isYearMonth); // true

Common mistakes and best practices

  • Greediness: .* is greedy; use lazy quantifiers .*? when needed.
  • Escaping: remember to escape meta-characters (., +, ?, (, ), [, ], {, }, ^, $, |, \).
  • Readability: for long patterns, use verbose/commented mode (re.VERBOSE in Python) or split into parts.
  • Performance: avoid patterns that cause catastrophic backtracking; prefer explicit classes and anchors.
  • Internationalization: for accented or Unicode characters, enable Unicode-aware modes or libraries.

See also

« Back to Glossary Index