Observability

Every Elido service emits Prometheus metrics on /metrics, OpenTelemetry traces over OTLP, and structured JSON logs to stdout. Nothing is locked behind a vendor — point your existing stack at the standard endpoints and it works.

1. Metrics — Prometheus

Each service exposes a /metrics endpoint on the management port (9090 for the Go services, 9464 for the Node ones). Scrape it with your existing Prometheus setup:


# prometheus.yml
scrape_configs:
  - job_name: elido-edge-redirect
    metrics_path: /metrics
    static_configs:
      - targets:
          - "edge-fra.elido.internal:9090"
          - "edge-ash.elido.internal:9090"
          - "edge-sgp.elido.internal:9090"
 
  - job_name: elido-api-core
    static_configs:
      - targets: ["api-core.elido.internal:9090"]

The metrics that matter most:

Metric	Type	Notes
`elido_redirect_requests_total{status,cache}`	counter	by HTTP status + cache hit / miss
`elido_redirect_latency_seconds`	histogram	hot-path latency, buckets at p50 / p95 / p99
`elido_link_cache_size`	gauge	L1 LRU cache occupancy on the edge
`elido_click_publish_failed_total`	counter	redpanda publish failures (cold-path)
`elido_db_query_seconds{query}`	histogram	per-query Postgres latency
`elido_clickhouse_insert_seconds`	histogram	ingester batch latency
`elido_billing_invoice_total{state}`	counter	invoices by `paid` / `past_due` / `void`

Histograms expose _bucket, _sum, _count — Grafana’s histogram_quantile over the bucket series gives you p95 directly.

2. Traces — OpenTelemetry

Every service is wired with the OTel SDK out of the box. Set the exporter endpoint:


OTEL_EXPORTER_OTLP_ENDPOINT=https://otel.your-collector.example
OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=…"
OTEL_SERVICE_NAME=elido-edge-redirect

The trace structure for a single redirect:


edge.redirect (4.8ms)
├── cache.lookup (0.6ms)            [L1 → L2 → origin fallthrough]
├── rule.evaluate (0.3ms)           [smart-link rule eval, if any]
├── response.write (1.0ms)
└── click.publish (0.4ms, async)    [redpanda fire-and-forget]

Spans carry the link id, workspace id, region pop, and cache layer that served the response — useful for “why is this one tenant slow?” investigations.

3. Logs — structured JSON

Logs go to stdout in JSON Lines format with a stable schema:


{
  "ts": "2026-05-08T10:23:14.521Z",
  "level": "info",
  "service": "edge-redirect",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "msg": "redirect served",
  "link_id": "lnk_8a2f",
  "workspace_id": 1,
  "pop": "fra",
  "cache": "l1",
  "status": 302,
  "latency_ms": 4.8
}

trace_id matches the OTel trace, so a click that’s slow in Grafana drills straight into the corresponding span in Honeycomb.

Pipe to your aggregator with whatever sidecar you prefer:


# Vector → Datadog example
vector --config vector.yaml

4. SLOs (recommended)

We track these on every region; replicate them on your stack if you self-host:

SLO	Target	Window
Redirect availability	99.95%	30 days rolling
Cache-hit p95	< 15ms	5 min
API-core 5xx rate	< 0.1%	5 min
Click publish success	99.99%	5 min
Click ingest lag	< 5s	1 min

The elido_redirect_latency_seconds histogram + a (rate(_count{status=~"2.."}) / rate(_count)) ratio give you both the SLI and burn-rate alerts in one Grafana panel.

5. Health endpoints

Each service exposes:

/healthz — process up
/readyz — process can serve traffic (DB / cache reachable)

/readyz returning 503 takes the pod out of rotation in K8s — no retry logic needed in the load balancer.

6. Edge cases

Cardinality — link_id is high-cardinality. We don’t expose it as a metric label by default; if you need per-link metrics, query ClickHouse instead. Adding it as a label will blow up your Prometheus.
Sampling — by default 10% of traces are sampled. Override with OTEL_TRACES_SAMPLER=always_on for debugging; revert before production traffic.
Self-hosted retention — we don’t ship a metrics retention store; bring your own (Mimir, Cortex, or Prometheus + remote write to anything).