π datatrax - Awesome Go Library for Machine Learning

Data engineering and classic ML toolkit with batch processing, type coercion, and 7 algorithms in pure Go with zero dependencies
Detailed Description of datatrax
Datatrax
Data engineering and machine learning toolkit for Go.
Batch processing, type coercion, deduplication, date utilities, and classic ML algorithms β all in pure Go with zero external dependencies.
Why Datatrax?
Most data engineers use Go for pipelines and Python for everything else. Datatrax eliminates the context switch β type coercion, batch processing, deduplication, and classic ML all in one Go module.
- Zero dependencies β pure Go stdlib, nothing to audit
- Generics-first β built for Go 1.21+, type-safe by default
- Battle-tested utilities β born from real-world ETL pipelines processing 500k+ records/day
- ML without Python β classic algorithms with a scikit-learn-simple API (coming soon)
Install
go get github.com/rbmuller/datatrax
Packages
| Package | Description | Key Functions |
|---|---|---|
batch | Split slices into chunks for parallel processing | ChunkArray[T] |
coerce | Convert interface{} to typed values safely | Floatify, Integerify, Boolify, Stringify |
dateutil | Date/time parsing, conversion, and math | EpochToTimestamp, DaysDifference, StringToDate |
dedup | Remove duplicates from any comparable slice | Deduplicate[T] |
errutil | Errors with automatic file:line location | NewError |
maputil | Map operations β copy, generate from JSON | CopyMap[K,V], GenerateMap |
mathutil | Safe math operations | Divide (zero-safe) |
strutil | String utilities and generic search | Contains[T], TrimQuotes, SplitByRegexp |
ml | ML algorithms β 8 models, metrics, preprocessing | LinearRegression, KNN, KMeans, RandomForest, ... |
Quick Start
Batch Processing
Split large datasets into manageable chunks for parallel processing:
import "github.com/rbmuller/datatrax/batch"
records := []int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
chunks := batch.ChunkArray(records, 3)
// [[1 2 3] [4 5 6] [7 8 9] [10]]
// Process chunks in parallel
for _, chunk := range chunks {
go processChunk(chunk)
}
Type Coercion
Safely convert untyped data from JSON, CSV, or database results:
import "github.com/rbmuller/datatrax/coerce"
val, err := coerce.Floatify("3.14") // 3.14, nil
val, err := coerce.Floatify(42) // 42.0, nil
val, err := coerce.Integerify("100") // 100, nil
val, err := coerce.Boolify(1) // true, nil
val, err := coerce.Stringify(3.14) // "3.14", nil
Deduplication
Remove duplicates from any comparable slice β strings, ints, structs:
import "github.com/rbmuller/datatrax/dedup"
names := []string{"Alice", "Bob", "Alice", "Charlie", "Bob"}
unique := dedup.Deduplicate(names)
// ["Alice", "Bob", "Charlie"]
ids := []int{1, 2, 3, 2, 1, 4}
unique := dedup.Deduplicate(ids)
// [1, 2, 3, 4]
Date Utilities
Parse, convert, and calculate date differences:
import "github.com/rbmuller/datatrax/dateutil"
// Convert epoch milliseconds to readable timestamp
ts, ok := dateutil.EpochToTimestamp(1684624830053)
// "2023-05-21 02:00:30"
// Calculate days between dates
days, err := dateutil.DaysDifference("2024-01-01", "2024-03-15", "2006-01-02")
// 74
// Parse date strings
t, err := dateutil.StringToDate("2024-03-15", "2006-01-02")
Error Utilities
Wrap errors with automatic source file and line number:
import "github.com/rbmuller/datatrax/errutil"
err := errutil.NewError(errors.New("connection timeout"))
fmt.Println(err)
// "main.go:42 - connection timeout"
// Supports errors.Is / errors.As via Unwrap()
errors.Is(err, originalErr) // true
String Utilities
Generic search, trimming, and formatting:
import "github.com/rbmuller/datatrax/strutil"
// Generic contains β works with any comparable type
strutil.Contains([]string{"a", "b", "c"}, "b") // true
strutil.Contains([]int{1, 2, 3}, 5) // false
// Trim surrounding quotes
strutil.TrimQuotes(`"hello world"`) // "hello world"
// Join with quotes for SQL
strutil.StringifyWithQuotes([]string{"a", "b"}) // "'a','b'"
// Safe index access β no panics
strutil.SafeIndex([]string{"a", "b"}, 5) // "", false
Map Utilities
Copy maps and parse JSON:
import "github.com/rbmuller/datatrax/maputil"
// Generic shallow copy
original := map[string]int{"a": 1, "b": 2}
copied := maputil.CopyMap(original)
// Parse JSON bytes to map
data := []byte(`{"name": "datatrax", "version": 1}`)
m, err := maputil.GenerateMap(data)
Safe Math
Division without panics:
import "github.com/rbmuller/datatrax/mathutil"
mathutil.Divide(10, 3) // 3.333...
mathutil.Divide(10, 0) // 0 (no panic)
Machine Learning
8 ML algorithms with a consistent Fit / Predict API β pure Go, zero dependencies.
| Algorithm | Type | Key Config |
|---|---|---|
LinearRegression | Regression | LearningRate, Epochs (+ Normal Equation) |
LogisticRegression | Classification | LearningRate, Epochs, Threshold |
KNN | Classification | K, Distance (euclidean/manhattan), Weighted |
KMeans | Clustering | K, MaxIter (K-Means++ init) |
DecisionTree | Classification | MaxDepth, MinSamples, Criterion (gini/entropy) |
RandomForest | Classification | NTrees, MaxDepth, MaxFeatures, OOB Score |
GaussianNB | Classification | β (parameter-free) |
MultinomialNB | Classification | Alpha (Laplace smoothing) |
Infrastructure: Dataset (CSV loading, train/test split), Preprocessing (MinMaxScale, StandardScale), Encoding (OneHot, Label), Metrics (Accuracy, Precision, Recall, F1, MSE, RMSE, MAE, RΒ², ConfusionMatrix), K-Fold Cross Validation.
Benchmarks
All benchmarks on Apple M4, 1000 samples, 10 features:
| Algorithm | Fit | Predict (100 samples) | Allocs |
|---|---|---|---|
| LinearRegression | 828Β΅s | 0.4Β΅s | 1 |
| LogisticRegression | 2.5ms | 1.3Β΅s | 2 |
| KNN | β (stores data) | 10.1ms | 601 |
| KMeans | 1.9ms | β | 223 |
| DecisionTree | 849ms | 1.4Β΅s | 1 |
| GaussianNB | 41Β΅s | 36Β΅s | 102 |
| Utility | Operation | Speed | Allocs |
|---|---|---|---|
| ChunkArray | 10k items, chunks of 100 | 377ns | 1 |
| Deduplicate | 10k strings, 50% dupes | 314Β΅s | 3 |
| Floatify | Single conversion | 27ns | 0 |
| Contains | 10k elements, worst case | 20Β΅s | 0 |
Linear Regression
import "github.com/rbmuller/datatrax/ml"
model := ml.NewLinearRegression()
model.Fit(xTrain, yTrain)
predictions := model.Predict(xTest)
fmt.Println("RΒ²:", ml.R2Score(yTest, predictions))
Classification (KNN)
clf := ml.NewKNN(ml.KNNConfig{K: 5, Distance: "euclidean"})
clf.Fit(xTrain, yTrain)
predictions := clf.Predict(xTest)
fmt.Println("Accuracy:", ml.Accuracy(yTest, predictions))
fmt.Println("F1:", ml.F1Score(yTest, predictions, 1.0))
Clustering (K-Means)
km := ml.NewKMeans(ml.KMeansConfig{K: 3, MaxIter: 100})
km.Fit(data)
labels := km.Predict(data)
fmt.Println("Inertia:", km.Inertia())
Decision Tree
dt := ml.NewDecisionTree(ml.DecisionTreeConfig{
MaxDepth: 5,
MinSamples: 2,
Criterion: "gini",
})
dt.Fit(xTrain, yTrain)
predictions := dt.Predict(xTest)
fmt.Println("Importance:", dt.FeatureImportance())
Random Forest
rf := ml.NewRandomForest(ml.RandomForestConfig{
NTrees: 100,
MaxDepth: 10,
Criterion: "gini",
})
rf.Fit(xTrain, yTrain)
predictions := rf.Predict(xTest)
fmt.Println("Accuracy:", ml.Accuracy(yTest, predictions))
fmt.Println("OOB Score:", rf.OOBScore(xTrain, yTrain))
fmt.Println("Importance:", rf.FeatureImportance())
Preprocessing & Evaluation
// Scale features
xScaled := ml.MinMaxScale(xTrain)
// Cross validation
folds := ml.KFoldSplit(x, y, 5)
for _, fold := range folds {
model.Fit(fold.XTrain, fold.YTrain)
pred := model.Predict(fold.XTest)
fmt.Println("Fold RΒ²:", ml.R2Score(fold.YTest, pred))
}
// Full metrics
fmt.Println("Accuracy:", ml.Accuracy(yTrue, yPred))
fmt.Println("Precision:", ml.Precision(yTrue, yPred, 1.0))
fmt.Println("Recall:", ml.Recall(yTrue, yPred, 1.0))
fmt.Println("Confusion:", ml.ConfusionMatrix(yTrue, yPred))
Load Dataset from CSV
dataset, err := ml.LoadCSV("data.csv", 4) // target is column 4
xTrain, xTest, yTrain, yTest := dataset.Split(0.8)
Roadmap
| Version | What | Status |
|---|---|---|
| v0.1.0 | Core utilities β 8 packages, 47 tests, zero deps | Done |
| v0.5.0 | Classic ML β 6 algorithms, preprocessing, metrics, cross-validation | Done |
| v1.1.0 | Full ML β 7 algorithms, benchmarks, encoding, tree viz, examples | Done |
| v2.0.0 | Random Forest, SVM, PCA, ensemble methods | Planned |
Design Principles
- Zero dependencies β If it can be done with stdlib, it will be
- Generics everywhere β Type safety is not optional
- No silent failures β Functions return
(value, error), not zero values - Pipeline-ready β Every function works with slices and streams
- Documentation-driven β If it's not documented, it doesn't exist
Why Datatrax over existing Go ML libs?
| Library | Status | How Datatrax compares |
|---|---|---|
| goml | Abandoned (2019) | Active, modern Go 1.21+ with generics |
| golearn | Abandoned (2020) | Simpler API, batteries included |
| gorgonia | Active but complex | scikit-learn-simple, not TensorFlow-complex |
| sajari/regression | Regression only | Full toolkit: utilities + ML + preprocessing |
Datatrax is NOT competing with deep learning frameworks. It's the scikit-learn of Go β classic ML with a clean API, plus data engineering utilities that no other Go ML lib offers.
Contributing
Contributions are welcome! Please:
- Fork the repo
- Create a feature branch (
git checkout -b feat/amazing-feature) - Write tests for your changes
- Ensure
go test -race ./...passes - Open a PR
License
MIT β Robson Bayer MΓΌller, 2026