Observability lets you detect regressions and quantify reliability. Here’s a pragmatic approach for teams of any size.
What to log
- Endpoint, latency, status, and points used
- Request IDs and correlation IDs for tracing
- Sampled payloads (1–5%) for quality reviews
Metrics to track
- p95 latency by endpoint
- Error rates by code (4xx vs 5xx)
- Request volume and cache hit rate
Structured error handling
try {
const res = await fetch(url, opts)
if (!res.ok) {
const err = await res.json()
logger.error({ endpoint, status: res.status, err })
throw new Error(err.message)
}
} catch (e) {
// attach correlation id and user id if available
logger.error({ msg: e.message, correlationId })
throw e
}FAQ
How do I estimate SLOs?
Choose a target (e.g., 99.9% success over 30 days) and track error budgets; alert on fast burn.
Which calls should I sample?
High-volume endpoints (facts, jokes) at low rate; full capture for low-volume premium calls.
