š memlog - Awesome Go Library for Data Structures and Algorithms
An easy to use, lightweight, thread-safe and append-only in-memory data structure inspired by Apache Kafka.
Detailed Description of memlog
About
tl;dr
An easy to use, lightweight, thread-safe and append-only in-memory data structure modeled as a Log.
The Log
also serves as an abstraction and building block. See
sharded.Log
for an implementation of a sharded
variant of memlog.Log
.
ā Note: this package is not about providing an in-memory logging
library. To
read more about the ideas behind memlog
please see "The Log: What every
software engineer should know about real-time data's unifying
abstraction".
Motivation
I keep hitting the same user story (use case) over and over again: one or more clients connected to my application wanting to read an immutable stream of data, e.g. events or sensor data, in-order, concurrently (thread-safe) and asynchronously (at their own pace) and in a resource (memory) efficient way.
There's many solutions to this problem, e.g. exposing some sort of streaming API (gRPC, HTTP/REST long-polling) based on custom logic using Go channels or an internal ring buffer, or putting data into an external platform like Kafka, Redis Streams or RabbitMQ Streams.
The challenges I faced with these solutions were that either they were too complex (or simply overkill) for my problem. Or, the system I had to integrate with and read data from did not have a nice streaming API or Go SDK, thus repeating myself writing complex internal caching, buffering and concurrency handling logic for the client APIs.
I looked around and could not find a simple and easy to use Go library for this
problem, so I created memlog
: an easy to use, lightweight (in-memory),
thread-safe, append-only log inspired by popular streaming systems with a
minimal API using Go's standard library primitives š¤©
š” For an end-to-end API modernization example using memlog
see the
vsphere-event-streaming
project, which transforms
a SOAP-based events API into an HTTP/REST streaming API.
Usage
ml, _ := memlog.New(ctx) // create log
offset, _ := ml.Write(ctx, []byte("Hello World")) // write some data
record, _ := ml.Read(ctx, offset) // read back data
fmt.Printf(string(record.Data)) // prints "Hello World"
The memlog
API is intentionally kept minimal. A new Log
is constructed with memlog.New(ctx, options...)
. Data as
[]byte
is written to the log with Log.Write(ctx, data)
.
The first write to the Log
using default Options
starts at position
(Offset
) 0
. Every write creates an immutable Record
in the Log
.
Records
are purged from the Log
when the history segment
is replaced
(see notes below).
The earliest and latest Offset
available in a Log
can be retrieved with
Log.Range(ctx)
.
A specified Record
can be read with Log.Read(ctx, offset)
.
š” Instead of manually polling the Log
for new Records
, the streaming API
Log.Stream(ctx, startOffset)
should be used.
(Not) one Log
to rule them all
One is not constrained by just creating one Log
. For certain use cases,
creating multiple Logs
might be useful. For example:
- Manage completely different data sets/sizes in the same process
- Setting different
Log
sizes (i.e. retention times), e.g. premium users will have access to a larger history ofRecords
- Partitioning input data by type or key
š” For use cases where you want to order the log by key(s)
, consider using the
specialised sharded.Log
.
Full Example
package main
import (
"context"
"fmt"
"os"
"github.com/embano1/memlog"
)
func main() {
ctx := context.Background()
l, err := memlog.New(ctx)
if err != nil {
fmt.Printf("create log: %v", err)
os.Exit(1)
}
offset, err := l.Write(ctx, []byte("Hello World"))
if err != nil {
fmt.Printf("write: %v", err)
os.Exit(1)
}
fmt.Printf("reading record at offset %d\n", offset)
record, err := l.Read(ctx, offset)
if err != nil {
fmt.Printf("read: %v", err)
os.Exit(1)
}
fmt.Printf("data says: %s", record.Data)
// reading record at offset 0
// data says: Hello World
}
Purging the Log
The Log
is divided into an active and history segment
. When the active
segment
is full (configurable via WithMaxSegmentSize()
), it is sealed
(i.e. read-only) and becomes the history segment
. A new empty active
segment
is created for writes. If there is an existing history, it is
replaced, i.e. all Records
are purged from the history.
See pkg.go.dev for the API reference and examples.
A stateless Log? You gotta be kidding!
True, it sounds like an oxymoron. Why would someone use (build) an in-memory append-only log that is not durable?
I'm glad you asked š
This library certainly is not intended to replace messaging, queuing or
streaming systems. It was built for use cases where there exists a durable
data/event source, e.g. a legacy system, REST API, database, etc. that can't
(or should not) be changed. But the requirement being that the (source) data
should be made available over a streaming-like API, e.g. gRPC or processed by
a Go application which requires the properties of a Log
.
memlog
helps as it allows to bridge between these different APIs and use cases
as a building block to extract and store data Records
from an external
system into an in-memory Log
(think ordered cache).
These Records
can then be internally processed (lightweight ETL) or served
asynchronously, in-order (Offset
-based) and concurrently over a modern
streaming API, e.g. gRPC or HTTP/REST (chunked encoding via long polling), to
remote clients.
As another example of such an in-memory log-structured design,
DDlog
follows a similar
approach, where a DDlog
program is used in conjunction with a persistent
database, with database records being fed to DDlog as ground facts.
Checkpointing
Given the data source needs to be durable in this design, one can optionally
build periodic checkpointing logic using the Record
Offset
as the checkpoint
value.
š” When running in Kubernetes,
kvstore
provides a nice
abstraction on top of a ConfigMap
for such requirements.
If the memlog
process crashes, it can then resume from the last checkpointed
Offset
, load the changes since then from the source and resume streaming.
š” This approach is quiet similar to the Kubernetes ListerWatcher()
pattern. See
memlog_test.go
for some inspiration.
Benchmark
I haven't done any extensive benchmarking or code optimization. Feel free to chime in and provide meaningful feedback/optimizations.
One could argue, whether using two slices (active and history data []Record
as part of the individual segments
) is a good engineering choice,
e.g. over using a growable slice as an alternative.
The reason I went for two segments
was that for me dividing the Log
into
multiple segments
with fixed size (and capacity) was easier to reason
about in the code (and I followed my intuition from how log-structured data
platforms do it). I did not inspect the Go compiler optimizations, e.g. it might
actually be smart and create one growable slice under the hood. š¤
These are some results on my MacBook using a log size of 1,000
(records),
i.e. where the Log
history is constantly purged and new segments
(slices)
are created.
go test -v -run=none -bench=. -cpu 1,2,4,8,16 -benchmem
goos: darwin
goarch: amd64
pkg: github.com/embano1/memlog
cpu: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
BenchmarkLog_write
BenchmarkLog_write 11107804 103.0 ns/op 89 B/op 1 allocs/op
BenchmarkLog_write-2 11115896 107.1 ns/op 89 B/op 1 allocs/op
BenchmarkLog_write-4 11419497 105.7 ns/op 89 B/op 1 allocs/op
BenchmarkLog_write-8 10253677 109.6 ns/op 89 B/op 1 allocs/op
BenchmarkLog_write-16 10865994 107.7 ns/op 89 B/op 1 allocs/op
BenchmarkLog_read
BenchmarkLog_read 24461548 49.49 ns/op 32 B/op 1 allocs/op
BenchmarkLog_read-2 25002574 46.63 ns/op 32 B/op 1 allocs/op
BenchmarkLog_read-4 23829378 47.47 ns/op 32 B/op 1 allocs/op
BenchmarkLog_read-8 22936821 47.47 ns/op 32 B/op 1 allocs/op
BenchmarkLog_read-16 24121807 48.25 ns/op 32 B/op 1 allocs/op
PASS
ok github.com/embano1/memlog 12.541s