Observability
Every Elido service emits Prometheus metrics on /metrics,
OpenTelemetry traces over OTLP, and structured JSON logs to stdout.
Nothing is locked behind a vendor — point your existing stack at
the standard endpoints and it works.
1. Metrics — Prometheus
Each service exposes a /metrics endpoint on the management port
(9090 for the Go services, 9464 for the Node ones). Scrape it
with your existing Prometheus setup:
# prometheus.yml
scrape_configs:
- job_name: elido-edge-redirect
metrics_path: /metrics
static_configs:
- targets:
- "edge-fra.elido.internal:9090"
- "edge-ash.elido.internal:9090"
- "edge-sgp.elido.internal:9090"
- job_name: elido-api-core
static_configs:
- targets: ["api-core.elido.internal:9090"]The metrics that matter most:
| Metric | Type | Notes |
|---|---|---|
elido_redirect_requests_total{status,cache} | counter | by HTTP status + cache hit / miss |
elido_redirect_latency_seconds | histogram | hot-path latency, buckets at p50 / p95 / p99 |
elido_link_cache_size | gauge | L1 LRU cache occupancy on the edge |
elido_click_publish_failed_total | counter | redpanda publish failures (cold-path) |
elido_db_query_seconds{query} | histogram | per-query Postgres latency |
elido_clickhouse_insert_seconds | histogram | ingester batch latency |
elido_billing_invoice_total{state} | counter | invoices by paid / past_due / void |
Histograms expose _bucket, _sum, _count — Grafana’s
histogram_quantile over the bucket series gives you p95 directly.
2. Traces — OpenTelemetry
Every service is wired with the OTel SDK out of the box. Set the exporter endpoint:
OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.your-collector.example
OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=…"
OTEL_SERVICE_NAME=elido-edge-redirectThe trace structure for a single redirect:
edge.redirect (4.8ms)
├── cache.lookup (0.6ms) [L1 → L2 → origin fallthrough]
├── rule.evaluate (0.3ms) [smart-link rule eval, if any]
├── response.write (1.0ms)
└── click.publish (0.4ms, async) [redpanda fire-and-forget]Spans carry the link id, workspace id, region pop, and cache layer that served the response — useful for “why is this one tenant slow?” investigations.
3. Logs — structured JSON
Logs go to stdout in JSON Lines format with a stable schema:
{
"ts": "2026-05-08T10:23:14.521Z",
"level": "info",
"service": "edge-redirect",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"msg": "redirect served",
"link_id": "lnk_8a2f",
"workspace_id": 1,
"pop": "fra",
"cache": "l1",
"status": 302,
"latency_ms": 4.8
}trace_id matches the OTel trace, so a click that’s slow in
Grafana drills straight into the corresponding span in Honeycomb.
Pipe to your aggregator with whatever sidecar you prefer:
# Vector → Datadog example
vector --config vector.yaml4. SLOs (recommended)
We track these on every region; replicate them on your stack if you self-host:
| SLO | Target | Window |
|---|---|---|
| Redirect availability | 99.95% | 30 days rolling |
| Cache-hit p95 | < 15ms | 5 min |
| API-core 5xx rate | < 0.1% | 5 min |
| Click publish success | 99.99% | 5 min |
| Click ingest lag | < 5s | 1 min |
The elido_redirect_latency_seconds histogram + a
(rate(_count{status=~"2.."}) / rate(_count)) ratio give you both
the SLI and burn-rate alerts in one Grafana panel.
5. Health endpoints
Each service exposes:
/healthz— process up/readyz— process can serve traffic (DB / cache reachable)
/readyz returning 503 takes the pod out of rotation in K8s — no
retry logic needed in the load balancer.
6. Edge cases
- Cardinality —
link_idis high-cardinality. We don’t expose it as a metric label by default; if you need per-link metrics, query ClickHouse instead. Adding it as a label will blow up your Prometheus. - Sampling — by default 10% of traces are sampled. Override
with
OTEL_TRACES_SAMPLER=always_onfor debugging; revert before production traffic. - Self-hosted retention — we don’t ship a metrics retention store; bring your own (Mimir, Cortex, or Prometheus + remote write to anything).
See also
- Self-hosting — Helm chart with Prometheus ServiceMonitors pre-wired
- Architecture: edge-redirect — the hot-path service these metrics describe
- API reference —
/healthzand/readyzschemas