Observability lets you detect regressions and quantify reliability. Here’s a pragmatic approach for teams of any size.
What to log
- Endpoint, latency, status, and points used
- Request IDs and correlation IDs for tracing
- Sampled payloads (1–5%) for quality reviews
Metrics to track
- p95 latency by endpoint
- Error rates by code (4xx vs 5xx)
- Request volume and cache hit rate
Structured error handling
try { const res = await fetch(url, opts) if (!res.ok) { const err = await res.json() logger.error({ endpoint, status: res.status, err }) throw new Error(err.message) } } catch (e) { // attach correlation id and user id if available logger.error({ msg: e.message, correlationId }) throw e }
FAQ
How do I estimate SLOs?
Choose a target (e.g., 99.9% success over 30 days) and track error budgets; alert on fast burn.
Which calls should I sample?
High-volume endpoints (facts, jokes) at low rate; full capture for low-volume premium calls.