Architecture Overview
Tokage is 9 independently deployable Go microservices connected through Redpanda, with ClickHouse for analytics, PostgreSQL for control plane state, and Redis for fast auth caching.
Event Flow
- 1Ingest: SDK sends TokenUsageEvent to POST /v1/events
- 2Authenticate: Auth Service resolves API key to tenant_id via Redis cache (<5ms)
- 3Validate: Schema validation, rate limit check (1,000 events/s steady-state)
- 4Produce: Event written to Redpanda with acks=all (all in-sync replicas)
- 5ACK: HTTP 202 returned to client only after Redpanda ACK — never before
- 6Process: Event Processor consumes, deduplicates on event_id (Redis-backed)
- 7Cost-stamp: Price Registry lookup stamps cost_usd — client-supplied cost is ignored
- 8Classify: Data classification: BILLING, PII, CONF, OPS, or PUB
- 9Store: Written to ClickHouse (hot) + S3 Parquet (cold archive)
- 10Query: Materialized views power dashboards; raw table serves ad-hoc queries
Key Guarantees
Zero data loss post-ACK
No HTTP 202 before Redpanda acks=all (RF=3). Architecturally enforced, not convention.
Billing accuracy
Cost stamped server-side by Price Registry. <0.5% variance vs. provider invoice.
Structural tenant isolation
Enforced at database layer (PostgreSQL RLS, ClickHouse predicates) — not application code.
Immutable audit
S3 Object Lock COMPLIANCE mode. No delete path exists in any service.
Services
Ingestion Service
services/ingestion/Front door for all token usage data. Accepts POST /v1/events and POST /v1/events/batch. HTTP 202 is returned only after Redpanda acks=all.
Auth Service
services/auth/Resolves tok_live_* API keys to tenant_id. Redis L1 cache with in-memory LRU and circuit breaker. Fail-closed: unauthenticated events never enter the pipeline.
Event Processor
services/event-processor/Consumes from Redpanda, deduplicates on event_id, enriches with timestamp_ingest, cost-stamps via Price Registry, classifies data, routes to ClickHouse + S3 + quarantine.
Query API
services/query-api/Analytics and dashboards. Cost breakdowns by model, application, user, and team. Tenant-isolated: tenant_id is the first WHERE predicate on every query.
Price Registry
services/price-registry/Authoritative source of per-model token prices. All costs are stamped server-side. Client-supplied cost values are rejected.
Control Plane
services/control-plane/Tenant management, API key CRUD, user/team management. Backed by PostgreSQL with Row-Level Security.
Billing Pipeline
services/billing-pipeline/Daily aggregation job. Computes billing-accurate per-tenant totals and exports to Stripe/Orb.
Audit Log
services/audit-log/Writes immutable WORM records to S3 Object Lock (COMPLIANCE mode). No delete path exists. For SOC 2 Type II compliance.
Synthetic Monitor
services/synthetic-monitor/End-to-end health checks that exercise the full ingestion and query path. Emits Prometheus metrics.
Technology Stack
| Layer | Technology | Rationale |
|---|---|---|
| Language | Go 1.22 | Performance, simplicity, excellent concurrency primitives |
| HTTP | stdlib net/http | No framework overhead; production-proven |
| Message Broker | Redpanda v24.1 | Kafka-compatible, no ZooKeeper, ~3x lower p99 vs Kafka |
| OLAP | ClickHouse 24.3 | Best-in-class columnar compression, fast GROUP BY for analytics |
| Operational DB | PostgreSQL 16 | Row-Level Security for structural tenant isolation |
| Auth Cache | Redis 7 | Sub-5ms key resolution with circuit breaker (sony/gobreaker) |
| Archive | S3 Parquet | Cost-efficient cold storage, IAM-scoped access |
| Audit | S3 Object Lock | WORM compliance, immutable records, COMPLIANCE mode |
| Kafka Client | twmb/franz-go | Pure Go, excellent Redpanda compatibility |
| Logging | go.uber.org/zap | Structured, zero-allocation production logging |
| Dashboard | Next.js 14 + Tailwind | App Router, React Server Components, type-safe |