๐Ÿ“š coregex - Awesome Go Library for Text Processing

Go Gopher mascot for coregex

Production regex engine with Rust regex-crate architecture: multi-engine DFA/NFA, SIMD prefilters, drop-in stdlib replacement

๐Ÿท๏ธ Text Processing
๐Ÿ“‚ Regular Expressions
โญ 0 stars
View on GitHub ๐Ÿ”—

Detailed Description of coregex

coregex

GitHub Release Go Version Go Reference CI Go Report Card codecov License GitHub Stars GitHub Issues GitHub Discussions

High-performance regex engine for Go. Drop-in replacement for regexp with 3-3000x speedup.*

* Typical speedup 15-240x on real-world patterns. 1000x+ achieved on specific edge cases where prefilters skip entire input (e.g., IP pattern on text with no digits).

Why coregex?

Go's stdlib regexp is intentionally simple โ€” single NFA engine, no optimizations. This guarantees O(n) time but leaves performance on the table.

coregex brings Rust regex-crate architecture to Go:

  • Multi-engine: 17 strategies โ€” Lazy DFA, PikeVM, OnePass, BoundedBacktracker, and more
  • SIMD prefilters: AVX2/SSSE3 for fast candidate rejection
  • Reverse search: Suffix/inner literal patterns run 1000x+ faster
  • O(n) guarantee: No backtracking, no ReDoS vulnerabilities

Installation

go get github.com/coregx/coregex

Requires Go 1.25+. Minimal dependencies (golang.org/x/sys, github.com/coregx/ahocorasick).

Quick Start

package main

import (
    "fmt"
    "github.com/coregx/coregex"
)

func main() {
    re := coregex.MustCompile(`\w+@\w+\.\w+`)

    text := []byte("Contact [email protected] for help")

    // Find first match
    fmt.Printf("Found: %s\n", re.Find(text))

    // Check if matches (zero allocation)
    if re.MatchString("[email protected]") {
        fmt.Println("Valid email format")
    }
}

Performance

Cross-language benchmarks on 6MB input, AMD EPYC (source):

PatternGo stdlibcoregexRust regexvs stdlibvs Rust
Literal alternation554 ms4.5 ms0.72 ms122x6.2x slower
Multi-literal1572 ms12.4 ms5.5 ms126x2.2x slower
Inner .*keyword.*238 ms0.27 ms0.33 ms881x1.2x faster
Suffix .*\.txt239 ms1.9 ms1.2 ms125x1.5x slower
Multiline (?m)^/.*\.php102 ms0.34 ms0.75 ms299x2.2x faster
Email validation257 ms0.46 ms0.31 ms557x1.4x slower
URL extraction256 ms0.62 ms0.37 ms413x1.6x slower
IP address494 ms0.72 ms13.5 ms685x18.8x faster
Version \d+.\d+.\d+164 ms0.62 ms0.79 ms263x1.2x faster
Char class [\w]+478 ms42.1 ms56.4 ms11x1.3x faster
Word repeat (\w{2,8})+690 ms180 ms54.7 ms3x3.2x slower

Where coregex excels:

  • Multiline patterns ((?m)^/.*\.php) โ€” 2.2x faster than Rust, 299x vs stdlib
  • IP/phone patterns (\d+\.\d+\.\d+\.\d+) โ€” SIMD digit prefilter skips non-digit regions
  • Suffix patterns (.*\.log, .*\.txt) โ€” reverse search optimization (1000x+)
  • Inner literals (.*error.*, .*@example\.com) โ€” bidirectional DFA (900x+)
  • Multi-pattern (foo|bar|baz|...) โ€” Slim Teddy (โ‰ค32), Fat Teddy (33-64), or Aho-Corasick (>64)
  • Anchored alternations (^(\d+|UUID|hex32)) โ€” O(1) branch dispatch (5-20x)
  • Concatenated char classes ([a-zA-Z]+[0-9]+) โ€” DFA with byte classes (5-7x)
  • Zero-alloc iterators (AllIndex, AppendAllIndex) โ€” 0 heap allocs, up to 30% faster than FindAll. Email pattern faster than Rust with AppendAllIndex.

Features

Engine Selection

coregex automatically selects the optimal engine:

StrategyPattern TypeSpeedup
AnchoredLiteral^prefix.*suffix$32-133x
MultilineReverseSuffix(?m)^/.*\.php100-552x โšก
ReverseInner.*keyword.*100-900x
ReverseSuffix.*\.txt100-1100x
BranchDispatch^(\d+|UUID|hex32)5-20x
CompositeSequenceDFA[a-zA-Z]+[0-9]+5-7x
LazyDFAIP, complex patterns10-150x
AhoCorasicka|b|c|...|z (>64 patterns)75-113x
CharClassSearcher[\w]+, \d+4-25x
Slim Teddyfoo|bar|baz (2-32 patterns)15-240x
Fat Teddy33-64 patterns60-73x
OnePassAnchored captures10x
BoundedBacktrackerSmall patterns2-5x

API Compatibility

Drop-in replacement for regexp.Regexp:

// stdlib
re := regexp.MustCompile(pattern)

// coregex โ€” same API
re := coregex.MustCompile(pattern)

Supported methods:

  • Match, MatchString, MatchReader
  • Find, FindString, FindAll, FindAllString
  • FindIndex, FindStringIndex, FindAllIndex
  • FindSubmatch, FindStringSubmatch, FindAllSubmatch
  • ReplaceAll, ReplaceAllString, ReplaceAllFunc
  • Split, SubexpNames, NumSubexp
  • Longest, Copy, String

Zero-Allocation APIs

// Zero allocations โ€” boolean match
matched := re.IsMatch(text)

// Zero allocations โ€” single match indices
start, end, found := re.FindIndices(text)

// Zero allocations โ€” iterator over all matches (Go 1.23+)
for m := range re.AllIndex(data) {
    fmt.Printf("match at [%d, %d]\n", m[0], m[1])
}

// Zero allocations โ€” match content iterator
for s := range re.AllString(text) {
    fmt.Println(s)
}

// Buffer-reuse โ€” append to caller's slice (strconv.Append* pattern)
var buf [][2]int
for _, chunk := range chunks {
    buf = re.AppendAllIndex(buf[:0], chunk, -1)
    process(buf)
}

Configuration

config := coregex.DefaultConfig()
config.DFAMaxStates = 10000      // Limit DFA cache
config.EnablePrefilter = true    // SIMD acceleration

re, err := coregex.CompileWithConfig(pattern, config)

Thread Safety

A compiled *Regexp is safe for concurrent use by multiple goroutines:

re := coregex.MustCompile(`\d+`)

// Safe: multiple goroutines sharing one compiled pattern
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        re.FindString("test 123 data")  // thread-safe
    }()
}
wg.Wait()

Internally uses sync.Pool (same pattern as Go stdlib regexp) for per-search state management.

Syntax Support

Uses Go's regexp/syntax parser:

FeatureSupport
Character classes[a-z], \d, \w, \s
Quantifiers*, +, ?, {n,m}
Anchors^, $, \b, \B
Groups(...), (?:...), (?P<name>...)
Unicode\p{L}, \P{N}
Flags(?i), (?m), (?s)
BackreferencesNot supported (O(n) guarantee)

Architecture

Pattern โ†’ Parse โ†’ NFA โ†’ Literal Extract โ†’ Strategy Select
                                               โ†“
                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚ Engines (17 strategies):                   โ”‚
                  โ”‚  LazyDFA, PikeVM, OnePass,                 โ”‚
                  โ”‚  BoundedBacktracker, ReverseAnchored,      โ”‚
                  โ”‚  ReverseInner, ReverseSuffix,              โ”‚
                  โ”‚  ReverseSuffixSet, MultilineReverseSuffix, โ”‚
                  โ”‚  AnchoredLiteral, CharClassSearcher,       โ”‚
                  โ”‚  Teddy, DigitPrefilter, AhoCorasick,       โ”‚
                  โ”‚  CompositeSearcher, BranchDispatch, Both   โ”‚
                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                               โ†“
Input โ†’ Prefilter (SIMD) โ†’ Engine โ†’ Match Result

For detailed architecture documentation, see docs/ARCHITECTURE.md. For optimization details, see docs/OPTIMIZATIONS.md.

SIMD Primitives (AMD64):

  • memchr โ€” single byte search (AVX2)
  • memmem โ€” substring search (SSSE3)
  • Slim Teddy โ€” multi-pattern search, 2-32 patterns (SSSE3, 9+ GB/s)
  • Fat Teddy โ€” multi-pattern search, 33-64 patterns (AVX2, 9+ GB/s)

Pure Go fallback on other architectures.

Battle-Tested

coregex was tested in GoAWK. This real-world testing uncovered 15+ edge cases that synthetic benchmarks missed.

Powered by coregex: uawk

uawk is a modern AWK interpreter built on coregex:

Benchmark (10MB)GoAWKuawkSpeedup
Regex alternation1.85s97ms19x
IP matching290ms99ms2.9x
General regex320ms100ms3.2x
go install github.com/kolkov/uawk/cmd/uawk@latest
uawk '/error/ { print $0 }' server.log

We need more testers! If you have a project using regexp, try coregex and report issues.

Documentation

Comparison

coregexstdlibregexp2
Performance3-3000x fasterBaselineSlower
SIMDAVX2/SSSE3NoNo
O(n) guaranteeYesYesNo
BackreferencesNoNoYes
APIDrop-inโ€”Different

Use coregex for performance-critical code with O(n) guarantee. Use stdlib for simple cases where performance doesn't matter. Use regexp2 if you need backreferences (accept exponential worst-case).

Related

Inspired by:

License

MIT โ€” see LICENSE.


Status: Pre-1.0 (API may change). Ready for testing and feedback.

Releases ยท Issues ยท Discussions

Star History

Star History Chart