Skip to Content
Elido is in closed beta — APIs are stable but rate-limits and quotas may change before GA. Request access →
GuidesObservability

Observability

Every Elido service emits Prometheus metrics on /metrics, OpenTelemetry traces over OTLP, and structured JSON logs to stdout. Nothing is locked behind a vendor — point your existing stack at the standard endpoints and it works.

1. Metrics — Prometheus

Each service exposes a /metrics endpoint on the management port (9090 for the Go services, 9464 for the Node ones). Scrape it with your existing Prometheus setup:

# prometheus.yml scrape_configs: - job_name: elido-edge-redirect metrics_path: /metrics static_configs: - targets: - "edge-fra.elido.internal:9090" - "edge-ash.elido.internal:9090" - "edge-sgp.elido.internal:9090" - job_name: elido-api-core static_configs: - targets: ["api-core.elido.internal:9090"]

The metrics that matter most:

MetricTypeNotes
elido_redirect_requests_total{status,cache}counterby HTTP status + cache hit / miss
elido_redirect_latency_secondshistogramhot-path latency, buckets at p50 / p95 / p99
elido_link_cache_sizegaugeL1 LRU cache occupancy on the edge
elido_click_publish_failed_totalcounterredpanda publish failures (cold-path)
elido_db_query_seconds{query}histogramper-query Postgres latency
elido_clickhouse_insert_secondshistogramingester batch latency
elido_billing_invoice_total{state}counterinvoices by paid / past_due / void

Histograms expose _bucket, _sum, _count — Grafana’s histogram_quantile over the bucket series gives you p95 directly.

2. Traces — OpenTelemetry

Every service is wired with the OTel SDK out of the box. Set the exporter endpoint:

OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.your-collector.example OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=…" OTEL_SERVICE_NAME=elido-edge-redirect

The trace structure for a single redirect:

edge.redirect (4.8ms) ├── cache.lookup (0.6ms) [L1 → L2 → origin fallthrough] ├── rule.evaluate (0.3ms) [smart-link rule eval, if any] ├── response.write (1.0ms) └── click.publish (0.4ms, async) [redpanda fire-and-forget]

Spans carry the link id, workspace id, region pop, and cache layer that served the response — useful for “why is this one tenant slow?” investigations.

3. Logs — structured JSON

Logs go to stdout in JSON Lines format with a stable schema:

{ "ts": "2026-05-08T10:23:14.521Z", "level": "info", "service": "edge-redirect", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7", "msg": "redirect served", "link_id": "lnk_8a2f", "workspace_id": 1, "pop": "fra", "cache": "l1", "status": 302, "latency_ms": 4.8 }

trace_id matches the OTel trace, so a click that’s slow in Grafana drills straight into the corresponding span in Honeycomb.

Pipe to your aggregator with whatever sidecar you prefer:

# Vector → Datadog example vector --config vector.yaml

We track these on every region; replicate them on your stack if you self-host:

SLOTargetWindow
Redirect availability99.95%30 days rolling
Cache-hit p95< 15ms5 min
API-core 5xx rate< 0.1%5 min
Click publish success99.99%5 min
Click ingest lag< 5s1 min

The elido_redirect_latency_seconds histogram + a (rate(_count{status=~"2.."}) / rate(_count)) ratio give you both the SLI and burn-rate alerts in one Grafana panel.

5. Health endpoints

Each service exposes:

  • /healthz — process up
  • /readyz — process can serve traffic (DB / cache reachable)

/readyz returning 503 takes the pod out of rotation in K8s — no retry logic needed in the load balancer.

6. Edge cases

  • Cardinalitylink_id is high-cardinality. We don’t expose it as a metric label by default; if you need per-link metrics, query ClickHouse instead. Adding it as a label will blow up your Prometheus.
  • Sampling — by default 10% of traces are sampled. Override with OTEL_TRACES_SAMPLER=always_on for debugging; revert before production traffic.
  • Self-hosted retention — we don’t ship a metrics retention store; bring your own (Mimir, Cortex, or Prometheus + remote write to anything).

See also