πŸ“š datatrax - Awesome Go Library for Machine Learning

Go Gopher mascot for datatrax

Data engineering and classic ML toolkit with batch processing, type coercion, and 7 algorithms in pure Go with zero dependencies

🏷️ Machine Learning
πŸ“‚ Machine Learning
⭐ 0 stars
View on GitHub πŸ”—

Detailed Description of datatrax

Datatrax

Data engineering and machine learning toolkit for Go.

Batch processing, type coercion, deduplication, date utilities, and classic ML algorithms β€” all in pure Go with zero external dependencies.

Go Reference CI Go Report Card License: MIT Go Version


Why Datatrax?

Most data engineers use Go for pipelines and Python for everything else. Datatrax eliminates the context switch β€” type coercion, batch processing, deduplication, and classic ML all in one Go module.

  • Zero dependencies β€” pure Go stdlib, nothing to audit
  • Generics-first β€” built for Go 1.21+, type-safe by default
  • Battle-tested utilities β€” born from real-world ETL pipelines processing 500k+ records/day
  • ML without Python β€” classic algorithms with a scikit-learn-simple API (coming soon)

Install

go get github.com/rbmuller/datatrax

Packages

PackageDescriptionKey Functions
batchSplit slices into chunks for parallel processingChunkArray[T]
coerceConvert interface{} to typed values safelyFloatify, Integerify, Boolify, Stringify
dateutilDate/time parsing, conversion, and mathEpochToTimestamp, DaysDifference, StringToDate
dedupRemove duplicates from any comparable sliceDeduplicate[T]
errutilErrors with automatic file:line locationNewError
maputilMap operations β€” copy, generate from JSONCopyMap[K,V], GenerateMap
mathutilSafe math operationsDivide (zero-safe)
strutilString utilities and generic searchContains[T], TrimQuotes, SplitByRegexp
mlML algorithms β€” 8 models, metrics, preprocessingLinearRegression, KNN, KMeans, RandomForest, ...

Quick Start

Batch Processing

Split large datasets into manageable chunks for parallel processing:

import "github.com/rbmuller/datatrax/batch"

records := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
chunks := batch.ChunkArray(records, 3)
// [[1 2 3] [4 5 6] [7 8 9] [10]]

// Process chunks in parallel
for _, chunk := range chunks {
    go processChunk(chunk)
}

Type Coercion

Safely convert untyped data from JSON, CSV, or database results:

import "github.com/rbmuller/datatrax/coerce"

val, err := coerce.Floatify("3.14")    // 3.14, nil
val, err := coerce.Floatify(42)        // 42.0, nil
val, err := coerce.Integerify("100")   // 100, nil
val, err := coerce.Boolify(1)          // true, nil
val, err := coerce.Stringify(3.14)     // "3.14", nil

Deduplication

Remove duplicates from any comparable slice β€” strings, ints, structs:

import "github.com/rbmuller/datatrax/dedup"

names := []string{"Alice", "Bob", "Alice", "Charlie", "Bob"}
unique := dedup.Deduplicate(names)
// ["Alice", "Bob", "Charlie"]

ids := []int{1, 2, 3, 2, 1, 4}
unique := dedup.Deduplicate(ids)
// [1, 2, 3, 4]

Date Utilities

Parse, convert, and calculate date differences:

import "github.com/rbmuller/datatrax/dateutil"

// Convert epoch milliseconds to readable timestamp
ts, ok := dateutil.EpochToTimestamp(1684624830053)
// "2023-05-21 02:00:30"

// Calculate days between dates
days, err := dateutil.DaysDifference("2024-01-01", "2024-03-15", "2006-01-02")
// 74

// Parse date strings
t, err := dateutil.StringToDate("2024-03-15", "2006-01-02")

Error Utilities

Wrap errors with automatic source file and line number:

import "github.com/rbmuller/datatrax/errutil"

err := errutil.NewError(errors.New("connection timeout"))
fmt.Println(err)
// "main.go:42 - connection timeout"

// Supports errors.Is / errors.As via Unwrap()
errors.Is(err, originalErr) // true

String Utilities

Generic search, trimming, and formatting:

import "github.com/rbmuller/datatrax/strutil"

// Generic contains β€” works with any comparable type
strutil.Contains([]string{"a", "b", "c"}, "b")  // true
strutil.Contains([]int{1, 2, 3}, 5)              // false

// Trim surrounding quotes
strutil.TrimQuotes(`"hello world"`)  // "hello world"

// Join with quotes for SQL
strutil.StringifyWithQuotes([]string{"a", "b"})  // "'a','b'"

// Safe index access β€” no panics
strutil.SafeIndex([]string{"a", "b"}, 5)  // "", false

Map Utilities

Copy maps and parse JSON:

import "github.com/rbmuller/datatrax/maputil"

// Generic shallow copy
original := map[string]int{"a": 1, "b": 2}
copied := maputil.CopyMap(original)

// Parse JSON bytes to map
data := []byte(`{"name": "datatrax", "version": 1}`)
m, err := maputil.GenerateMap(data)

Safe Math

Division without panics:

import "github.com/rbmuller/datatrax/mathutil"

mathutil.Divide(10, 3)  // 3.333...
mathutil.Divide(10, 0)  // 0 (no panic)

Machine Learning

8 ML algorithms with a consistent Fit / Predict API β€” pure Go, zero dependencies.

AlgorithmTypeKey Config
LinearRegressionRegressionLearningRate, Epochs (+ Normal Equation)
LogisticRegressionClassificationLearningRate, Epochs, Threshold
KNNClassificationK, Distance (euclidean/manhattan), Weighted
KMeansClusteringK, MaxIter (K-Means++ init)
DecisionTreeClassificationMaxDepth, MinSamples, Criterion (gini/entropy)
RandomForestClassificationNTrees, MaxDepth, MaxFeatures, OOB Score
GaussianNBClassificationβ€” (parameter-free)
MultinomialNBClassificationAlpha (Laplace smoothing)

Infrastructure: Dataset (CSV loading, train/test split), Preprocessing (MinMaxScale, StandardScale), Encoding (OneHot, Label), Metrics (Accuracy, Precision, Recall, F1, MSE, RMSE, MAE, RΒ², ConfusionMatrix), K-Fold Cross Validation.

Benchmarks

All benchmarks on Apple M4, 1000 samples, 10 features:

AlgorithmFitPredict (100 samples)Allocs
LinearRegression828Β΅s0.4Β΅s1
LogisticRegression2.5ms1.3Β΅s2
KNNβ€” (stores data)10.1ms601
KMeans1.9msβ€”223
DecisionTree849ms1.4Β΅s1
GaussianNB41Β΅s36Β΅s102
UtilityOperationSpeedAllocs
ChunkArray10k items, chunks of 100377ns1
Deduplicate10k strings, 50% dupes314Β΅s3
FloatifySingle conversion27ns0
Contains10k elements, worst case20Β΅s0

Linear Regression

import "github.com/rbmuller/datatrax/ml"

model := ml.NewLinearRegression()
model.Fit(xTrain, yTrain)
predictions := model.Predict(xTest)
fmt.Println("RΒ²:", ml.R2Score(yTest, predictions))

Classification (KNN)

clf := ml.NewKNN(ml.KNNConfig{K: 5, Distance: "euclidean"})
clf.Fit(xTrain, yTrain)
predictions := clf.Predict(xTest)
fmt.Println("Accuracy:", ml.Accuracy(yTest, predictions))
fmt.Println("F1:", ml.F1Score(yTest, predictions, 1.0))

Clustering (K-Means)

km := ml.NewKMeans(ml.KMeansConfig{K: 3, MaxIter: 100})
km.Fit(data)
labels := km.Predict(data)
fmt.Println("Inertia:", km.Inertia())

Decision Tree

dt := ml.NewDecisionTree(ml.DecisionTreeConfig{
    MaxDepth:   5,
    MinSamples: 2,
    Criterion:  "gini",
})
dt.Fit(xTrain, yTrain)
predictions := dt.Predict(xTest)
fmt.Println("Importance:", dt.FeatureImportance())

Random Forest

rf := ml.NewRandomForest(ml.RandomForestConfig{
    NTrees:    100,
    MaxDepth:  10,
    Criterion: "gini",
})
rf.Fit(xTrain, yTrain)
predictions := rf.Predict(xTest)
fmt.Println("Accuracy:", ml.Accuracy(yTest, predictions))
fmt.Println("OOB Score:", rf.OOBScore(xTrain, yTrain))
fmt.Println("Importance:", rf.FeatureImportance())

Preprocessing & Evaluation

// Scale features
xScaled := ml.MinMaxScale(xTrain)

// Cross validation
folds := ml.KFoldSplit(x, y, 5)
for _, fold := range folds {
    model.Fit(fold.XTrain, fold.YTrain)
    pred := model.Predict(fold.XTest)
    fmt.Println("Fold RΒ²:", ml.R2Score(fold.YTest, pred))
}

// Full metrics
fmt.Println("Accuracy:", ml.Accuracy(yTrue, yPred))
fmt.Println("Precision:", ml.Precision(yTrue, yPred, 1.0))
fmt.Println("Recall:", ml.Recall(yTrue, yPred, 1.0))
fmt.Println("Confusion:", ml.ConfusionMatrix(yTrue, yPred))

Load Dataset from CSV

dataset, err := ml.LoadCSV("data.csv", 4) // target is column 4
xTrain, xTest, yTrain, yTest := dataset.Split(0.8)

Roadmap

VersionWhatStatus
v0.1.0Core utilities β€” 8 packages, 47 tests, zero depsDone
v0.5.0Classic ML β€” 6 algorithms, preprocessing, metrics, cross-validationDone
v1.1.0Full ML β€” 7 algorithms, benchmarks, encoding, tree viz, examplesDone
v2.0.0Random Forest, SVM, PCA, ensemble methodsPlanned

Design Principles

  1. Zero dependencies β€” If it can be done with stdlib, it will be
  2. Generics everywhere β€” Type safety is not optional
  3. No silent failures β€” Functions return (value, error), not zero values
  4. Pipeline-ready β€” Every function works with slices and streams
  5. Documentation-driven β€” If it's not documented, it doesn't exist

Why Datatrax over existing Go ML libs?

LibraryStatusHow Datatrax compares
gomlAbandoned (2019)Active, modern Go 1.21+ with generics
golearnAbandoned (2020)Simpler API, batteries included
gorgoniaActive but complexscikit-learn-simple, not TensorFlow-complex
sajari/regressionRegression onlyFull toolkit: utilities + ML + preprocessing

Datatrax is NOT competing with deep learning frameworks. It's the scikit-learn of Go β€” classic ML with a clean API, plus data engineering utilities that no other Go ML lib offers.

Contributing

Contributions are welcome! Please:

  1. Fork the repo
  2. Create a feature branch (git checkout -b feat/amazing-feature)
  3. Write tests for your changes
  4. Ensure go test -race ./... passes
  5. Open a PR

License

MIT β€” Robson Bayer MΓΌller, 2026