NIMBUS
Observability that respects your time.
A high-cardinality metrics + log + trace platform built for engineering teams that find Datadog too expensive and Grafana too DIY. Sane defaults, exceptional ergonomics.
From brief to production system.
Mid-size teams are stuck: Datadog scales costs faster than usage, Grafana stack requires a dedicated platform engineer. Logs and metrics live in silos. Incident timelines are stitched manually in Slack.
ClickHouse-backed columnar storage with Kafka ingestion. Trace-correlated logs by default. AI-assisted incident timelines that auto-stitch deploys, alerts, and Slack chatter into a single audit trail. PromQL + LogQL compatible query layer.
MTTR dropped 68% across 340 customer teams. Storage cost reduced 54% vs equivalent Datadog usage. Used by 3 YC-backed startups + 1 unicorn DevOps team. Now processing 1.2M events/sec at peak.
How it shipped, week by week.
Architecture + ClickHouse PoC
Benched ClickHouse against TimescaleDB and InfluxDB. ClickHouse won on storage compression (4.2x) and query speed at our cardinality.
Ingestion + storage
Spring Boot ingestion gateway. Kafka buffer for back-pressure tolerance. Schema design optimized for compression — saved 54% vs naive layout.
Query layer + dashboards
Custom planner translating PromQL → ClickHouse SQL. React dashboarding with 22 widget types. ECharts for performance over D3 at this scale.
AI incident timelines
Stitched alerts + deploys + Slack chatter into auto-generated post-mortems. Used Claude to summarize. Saved oncall an average of 40min per incident.
Production rollout
Migrated 340 customer teams over a phased rollout. Zero data loss. Cut over from the legacy stack in three days.
What it does. How it's built.
Features
- Unified metrics + logs + traces
- AI-assisted incident timeline reconstruction
- Cost-aware retention policies (hot → warm → cold)
- Slack-native alerting + chatops
- OpenTelemetry-first ingestion
- PromQL + LogQL compatible query layer
- Custom dashboarding with 22 widget types
- On-call runbook embedded in alerts
Architecture
- 01Spring Boot ingestion gateway
- 02Kafka for buffered event streams (3 brokers, 18 partitions)
- 03ClickHouse cluster (3-node) for storage
- 04React + ECharts dashboards
- 05Custom query planner translating PromQL → SQL
- 06Deployed on AWS ECS Fargate
- 07S3 for cold storage + Athena for archival queries
- 08Slack bot built on the Bolt SDK
Annotated excerpts.
@Component
public class EventRouter {
private final KafkaTemplate<String, Event> kafka;
private final SamplingPolicy sampler;
public CompletableFuture<Ack> route(Event event) {
if (!sampler.accept(event)) {
return Ack.dropped(event.id());
}
var topic = switch (event.kind()) {
case METRIC -> "metrics.raw";
case LOG -> "logs.raw";
case TRACE -> "traces.raw";
};
return kafka.send(topic, event.tenantId(), event)
.thenApply(r -> Ack.accepted(event.id(), r.getRecordMetadata().offset()))
.exceptionally(ex -> Ack.failed(event.id(), ex));
}
}CREATE TABLE metrics (
tenant_id LowCardinality(String),
metric_name LowCardinality(String),
timestamp DateTime64(3),
value Float64,
labels Map(LowCardinality(String), String)
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (tenant_id, metric_name, timestamp)
TTL timestamp + INTERVAL 30 DAY TO VOLUME 'cold',
timestamp + INTERVAL 90 DAY DELETE
SETTINGS storage_policy = 'tiered';We were burning $14k/month on a managed observability stack. Ali designed and shipped a self-hosted alternative that's now cheaper, faster, and easier to query. Paid for itself in 60 days.
Continue browsing
Have a project like this in mind? Let's talk.
Send me a brief and I'll respond within 24 hours.