ð kagome - Awesome Go Library for Natural Language Processing

JP morphological analyzer written in pure Go
Detailed Description of kagome
Kagome v2
Kagome is an open source Japanese morphological analyzer written in pure Go. It can tokenize Japanese text into words and analyze parts of speech, with dictionaries embedded in the binary for easy deployment.
[!NOTE] Key features (Improvements from v1):
- Self-contained binaries with embedded dictionaries (MeCab-IPADIC, UniDic)
- Multiple segmentation modes for different use cases
- RESTful API server mode for production use
- WebAssembly support for browser environments
- C library API for FFI integration (Python, PHP, and other languages)
Index
- Basic Usage
- Install
- Commands
- Dictionaries
- Segmentation modes
- Docker
- WebAssembly
- Use from other languages (FFI)
- Reference
- License
Basic Usage
Command line
% kagome -h
Japanese Morphological Analyzer -- github.com/ikawaha/kagome/v2
usage: kagome <command>
The commands are:
[tokenize] - command line tokenize (*default)
server - run tokenize server
lattice - lattice viewer
sentence - tiny sentence splitter
version - show version
tokenize [-file input_file] [-dict dic_file] [-userdict user_dic_file] [-sysdict (ipa|uni)] [-simple false] [-mode (normal|search|extended)] [-split] [-json]
-dict string
dict
-file string
input file
-json
outputs in JSON format
-mode string
tokenize mode (normal|search|extended) (default "normal")
-simple
display abbreviated dictionary contents
-split
use tiny sentence splitter
-sysdict string
system dict type (ipa|uni) (default "ipa")
-udict string
user dict
% # piped standard input
% echo "ãããããããããã®ãã¡" | kagome
ããã åè©,äžè¬,*,*,*,*,ããã,ã¹ã¢ã¢,ã¹ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã® å©è©,é£äœå,*,*,*,*,ã®,ã,ã
ãã¡ åè©,éèªç«,å¯è©å¯èœ,*,*,*,ãã¡,ãŠã,ãŠã
EOS
- For more details, see the Commands section.
As a Go library
You can integrate Kagome into your Go applications as follows:
# Install Kagome module
go get github.com/ikawaha/kagome/v2
package main
import (
"fmt"
"strings"
"github.com/ikawaha/kagome-dict/ipa"
"github.com/ikawaha/kagome/v2/tokenizer"
)
func main() {
t, err := tokenizer.New(ipa.Dict(), tokenizer.OmitBosEos())
if err != nil {
panic(err)
}
// wakati (simple word splitting/segmentation)
fmt.Println("---wakati---")
seg := t.Wakati("ãããããããããã®ãã¡")
fmt.Println(seg)
// tokenize w/ morphological analysis
fmt.Println("---tokenize---")
tokens := t.Tokenize("ãããããããããã®ãã¡")
for _, token := range tokens {
features := strings.Join(token.Features(), ",")
fmt.Printf("%s\t%v\n", token.Surface, features)
}
}
output:
---wakati---
[ããã ã ãã ã ãã ã® ãã¡]
---tokenize---
ããã åè©,äžè¬,*,*,*,*,ããã,ã¹ã¢ã¢,ã¹ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã® å©è©,é£äœå,*,*,*,*,ã®,ã,ã
ãã¡ åè©,éèªç«,å¯è©å¯èœ,*,*,*,ãã¡,ãŠã,ãŠã
As a C library
Kagome is written in pure Go but can be compiled as a C shared library and used from other languages via FFI (Foreign Function Interface).
See the "Use from other languages (FFI)" section below for details and examples.
More examples
We provide various examples demonstrating how to use Kagome in different scenarios:
Install
To get the kagome command line tool, choose your preferred installation method below:
-
Go (recommended)
go install github.com/ikawaha/kagome/v2@latest -
Homebrew
# macOS and Linux (for both AMD64 and Arm64) brew install ikawaha/kagome/kagome -
Manual Install
- For manual installation, download and extract the appropriate archived file for your OS and architecture from the releases page.
- Note that the extracted binary must be placed in an accessible directory with execution permission.
-
Docker/Docker Compose
- See the Docker section below
Commands
Major sub-commands of kagome command line tool.
Tokenize command
% # interactive/REPL mode
% kagome
ãããããããããã®ãã¡
ããã åè©,äžè¬,*,*,*,*,ããã,ã¹ã¢ã¢,ã¹ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã® å©è©,é£äœå,*,*,*,*,ã®,ã,ã
ãã¡ åè©,éèªç«,å¯è©å¯èœ,*,*,*,ãã¡,ãŠã,ãŠã
EOS
% # piped standard input
% echo "ãããããããããã®ãã¡" | kagome
ããã åè©,äžè¬,*,*,*,*,ããã,ã¹ã¢ã¢,ã¹ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã å©è©,ä¿å©è©,*,*,*,*,ã,ã¢,ã¢
ãã åè©,äžè¬,*,*,*,*,ãã,ã¢ã¢,ã¢ã¢
ã® å©è©,é£äœå,*,*,*,*,ã®,ã,ã
ãã¡ åè©,éèªç«,å¯è©å¯èœ,*,*,*,ãã¡,ãŠã,ãŠã
EOS
% # JSON output
% # (For jq command see https://jqlang.org/)
% echo "ç«" | kagome -json | jq .
[
{
"id": 286994,
"start": 0,
"end": 1,
"surface": "ç«",
"class": "KNOWN",
"pos": [
"åè©",
"äžè¬",
"*",
"*"
],
"base_form": "ç«",
"reading": "ãã³",
"pronunciation": "ãã³",
"features": [
"åè©",
"äžè¬",
"*",
"*",
"*",
"*",
"ç«",
"ãã³",
"ãã³"
]
}
]
% # word splitting/segmentation only (equivalent to "wakati" functionality)
% echo "ãããããããããã®ãã¡" | kagome -json | jq -r '[.[].surface] | join("/")'
ããã/ã/ãã/ã/ãã/ã®/ãã¡
% # Extract only pronunciations using jq (for Text-to-Speech purposes, etc.)
% echo "ç§ã¯ã¯ã«ãããããããã" | kagome -json | jq -r '.[].pronunciation'
ã¯ã¿ã·
ã¯
ããã¯
ãš
ã¯
ã¯
ã¯ã³ã¯ã³
Server command
For continuous usage, kagome provides a server mode to decouple the startup time of the tokenizer.
RESTful API
Start a server and try to access the "/tokenize" endpoint.
% kagome server &
% curl -XPUT localhost:6060/tokenize -d'{"sentence":"ãããããããããã®ãã¡", "mode":"normal"}' | jq .
Web App
Start a server and access http://localhost:6060 in your browser.
% kagome server &
[!IMPORTANT] The demo web application uses graphviz to draw a lattice. You need graphviz to be installed on your system.
[!TIP] Kagome can be compiled to WebAssembly (wasm) and run locally in a web browser as well. For details, see the WebAssembly section.
- Wasm Demo: https://ikawaha.github.io/kagome/
Lattice command
A debug tool of tokenize process outputs a lattice in graphviz dot format.
% kagome lattice ç§ã¯é°» | dot -Tpng -o lattice.png

Sentence command
Split long text into sentences:
% echo "åŸèŒ©ã¯ç«ã§ãããååã¯ãŸã ç¡ãã" | kagome sentence
åŸèŒ©ã¯ç«ã§ããã
ååã¯ãŸã ç¡ãã
This command is useful if a single line of data is too lengthy, and you want to avoid errors such as bufio.Scanner: token too long.
% echo "åŸèŒ©ã¯ç«ã§ãããååã¯ãŸã ç¡ãã" | kagome -json | jq -r '[.[].surface] | join("/")'
åŸèŒ©/ã¯/ç«/ã§/ãã/ã/åå/ã¯/ãŸã /ç¡ã/ã
% echo "åŸèŒ©ã¯ç«ã§ãããååã¯ãŸã ç¡ãã" | kagome sentence | kagome -json | jq -r '[.[].surface] | join("/")'
åŸèŒ©/ã¯/ç«/ã§/ãã/ã
åå/ã¯/ãŸã /ç¡ã/ã
This command is equivalent to the -split option of the tokenize command.
% echo "åŸèŒ©ã¯ç«ã§ãããååã¯ãŸã ç¡ãã" | kagome -split -json | jq -r '[.[].surface] | join("/")'
åŸèŒ©/ã¯/ç«/ã§/ãã/ã
åå/ã¯/ãŸã /ç¡ã/ã
Dictionaries
-
Currently supported dictionaries by default.
dict source package MeCab IPADIC mecab-ipadic-2.7.0-20070801 github.com/ikawaha/kagome-dict/ipa UniDIC unidic-mecab-2.1.2_src github.com/ikawaha/kagome-dict/uni -
Experimental Features
dict source package mecab-ipadic-NEologd mecab-ipadic-neologd github.com/ikawaha/kagome-ipa-neologd Korean MeCab mecab-ko-dic-2.1.1-20180720 github.com/ikawaha/kagome-dict-ko
[!NOTE] For more details and differences between the dictionaries, see the wiki.
Segmentation modes
Similar to Kuromoji, Kagome also supports various segmentation modes (splitting strategies) to tokenize the input text.
- Normal: Regular segmentation
- Search: Use a heuristic to perform additional segmentation that is useful for search purposes
- Extended: Similar to search mode, but also unknown words with uni-grams
| Untokenized | Normal | Search | Extended |
|---|---|---|---|
| é¢è¥¿åœé空枯 | é¢è¥¿åœé空枯 | é¢è¥¿ãåœéã空枯 | é¢è¥¿ãåœéã空枯 |
| æ¥æ¬çµæžæ°è | æ¥æ¬çµæžæ°è | æ¥æ¬ãçµæžãæ°è | æ¥æ¬ãçµæžãæ°è |
| ã·ãã¢ãœãããŠã§ã¢ãšã³ãžã㢠| ã·ãã¢ãœãããŠã§ã¢ãšã³ãžã㢠| ã·ãã¢ããœãããŠã§ã¢ããšã³ãžã㢠| ã·ãã¢ããœãããŠã§ã¢ããšã³ãžã㢠|
| ããžã«ã¡ãè²·ã£ã | ããžã«ã¡ãããè²·ã£ãã | ããžã«ã¡ãããè²·ã£ãã | ãããžãã«ãã¡ãããè²·ã£ãã |
[!NOTE] If your purpose is for search, try changing the mode before switching to another dictionary.
Docker
We provide scratch-based Docker images that simply run the kagome command line tool on various architectures: AMD64, Arm64, Arm32 (Arm v5, v6 and v7)
-
Pull the image
docker pull ikawaha/kagome:latest# Alternatively, you can pull from GitHub Container Registry docker pull ghcr.io/ikawaha/kagome:latest -
Run the command via Docker
# Interactive/REPL mode docker run --rm -it ikawaha/kagome:latest# If pulling from GitHub Container Registry docker run --rm -it ghcr.io/ikawaha/kagome:latest -
Run the server via Docker
# Server mode (http://localhost:6060) docker run --rm -p 6060:6060 ikawaha/kagome:latest server# If pulling from GitHub Container Registry docker run --rm -p 6060:6060 ghcr.io/ikawaha/kagome:latest server -
docker-compose.ymlexampleservices: kagome: image: ikawaha/kagome:latest ports: ["6060:6060"] command: server restart: unless-stopped
Note: Base image doesn't include Graphviz. For lattice visualization, see examples.
WebAssembly
Kagome compiles to WebAssembly for browser use.
- Live demo: https://ikawaha.github.io/kagome/
- Source code: ./_examples/wasm
Use from other languages (FFI)
Kagome is written in pure Go but can be compiled as a C shared library and used from other languages via FFI (Foreign Function Interface).
- Currently supported/tested languages:
- Python 3.12+ (using
ctypes) - PHP 8+ (using
FFI)
- Python 3.12+ (using
# Python example using ctypes
from libkagome import Kagome
kagome = Kagome()
tokens = kagome.tokenize("ãããããããããã®ãã¡")
for token in tokens:
print(f"{token.surface}\t{token.pos}")
<!-- PHP example using FFI -->
<?php
declare(strict_types=1);
require __DIR__ . '/libkagome.php';
$kagome = new Kagome();
$tokens = $kagome->tokenize("ãããããããããã®ãã¡");
foreach ($tokens as $token) {
echo "{$token->surface}\t" . implode(',', $token->pos) . "\n";
}
For complete examples and build instructions, see:
- ./_examples/clib/ - C library FFI examples for Python and PHP
[!NOTE] The C library provides thread-safe tokenization with proper memory management and includes comprehensive tests.
Reference
-
Detailed Reference Manual in Japanese:
-
Community Wiki in English:
License
- MIT
