tokage/docs/architecture

Architecture Overview

Tokage is 9 independently deployable Go microservices connected through Redpanda, with ClickHouse for analytics, PostgreSQL for control plane state, and Redis for fast auth caching.

Event Flow

  1. 1
    Ingest: SDK sends TokenUsageEvent to POST /v1/events
  2. 2
    Authenticate: Auth Service resolves API key to tenant_id via Redis cache (<5ms)
  3. 3
    Validate: Schema validation, rate limit check (1,000 events/s steady-state)
  4. 4
    Produce: Event written to Redpanda with acks=all (all in-sync replicas)
  5. 5
    ACK: HTTP 202 returned to client only after Redpanda ACK — never before
  6. 6
    Process: Event Processor consumes, deduplicates on event_id (Redis-backed)
  7. 7
    Cost-stamp: Price Registry lookup stamps cost_usd — client-supplied cost is ignored
  8. 8
    Classify: Data classification: BILLING, PII, CONF, OPS, or PUB
  9. 9
    Store: Written to ClickHouse (hot) + S3 Parquet (cold archive)
  10. 10
    Query: Materialized views power dashboards; raw table serves ad-hoc queries

Key Guarantees

Zero data loss post-ACK

No HTTP 202 before Redpanda acks=all (RF=3). Architecturally enforced, not convention.

Billing accuracy

Cost stamped server-side by Price Registry. <0.5% variance vs. provider invoice.

Structural tenant isolation

Enforced at database layer (PostgreSQL RLS, ClickHouse predicates) — not application code.

Immutable audit

S3 Object Lock COMPLIANCE mode. No delete path exists in any service.

Services

Ingestion Service

services/ingestion/

Front door for all token usage data. Accepts POST /v1/events and POST /v1/events/batch. HTTP 202 is returned only after Redpanda acks=all.

Auth Service

services/auth/

Resolves tok_live_* API keys to tenant_id. Redis L1 cache with in-memory LRU and circuit breaker. Fail-closed: unauthenticated events never enter the pipeline.

Event Processor

services/event-processor/

Consumes from Redpanda, deduplicates on event_id, enriches with timestamp_ingest, cost-stamps via Price Registry, classifies data, routes to ClickHouse + S3 + quarantine.

Query API

services/query-api/

Analytics and dashboards. Cost breakdowns by model, application, user, and team. Tenant-isolated: tenant_id is the first WHERE predicate on every query.

Price Registry

services/price-registry/

Authoritative source of per-model token prices. All costs are stamped server-side. Client-supplied cost values are rejected.

Control Plane

services/control-plane/

Tenant management, API key CRUD, user/team management. Backed by PostgreSQL with Row-Level Security.

Billing Pipeline

services/billing-pipeline/

Daily aggregation job. Computes billing-accurate per-tenant totals and exports to Stripe/Orb.

Audit Log

services/audit-log/

Writes immutable WORM records to S3 Object Lock (COMPLIANCE mode). No delete path exists. For SOC 2 Type II compliance.

Synthetic Monitor

services/synthetic-monitor/

End-to-end health checks that exercise the full ingestion and query path. Emits Prometheus metrics.

Technology Stack

LayerTechnologyRationale
LanguageGo 1.22Performance, simplicity, excellent concurrency primitives
HTTPstdlib net/httpNo framework overhead; production-proven
Message BrokerRedpanda v24.1Kafka-compatible, no ZooKeeper, ~3x lower p99 vs Kafka
OLAPClickHouse 24.3Best-in-class columnar compression, fast GROUP BY for analytics
Operational DBPostgreSQL 16Row-Level Security for structural tenant isolation
Auth CacheRedis 7Sub-5ms key resolution with circuit breaker (sony/gobreaker)
ArchiveS3 ParquetCost-efficient cold storage, IAM-scoped access
AuditS3 Object LockWORM compliance, immutable records, COMPLIANCE mode
Kafka Clienttwmb/franz-goPure Go, excellent Redpanda compatibility
Logginggo.uber.org/zapStructured, zero-allocation production logging
DashboardNext.js 14 + TailwindApp Router, React Server Components, type-safe