JSON Health Check API: Response Format, Dependencies & Kubernetes Probes

Last updated:

Most tutorials show a health endpoint that returns { "status": "ok" } with HTTP 200 — and that is where they stop. In production, that is not enough. A real JSON health check API needs to follow a standard response format so monitoring tools can parse it, report on each dependency individually, integrate with Kubernetes liveness and readiness probes correctly, and expose the right information without leaking internal infrastructure details. The IETF draft "Health Check Response Format for HTTP APIs" (draft-inadarei-api-health-check) defines a structured JSON schema using three status values — "pass", "fail", and "warn" — that has been adopted by most health check libraries. Getting the distinction between liveness and readiness probes wrong in Kubernetes is one of the most common causes of unnecessary pod restarts during dependency outages. This guide covers the full picture: IETF response format, per-dependency checks with proper parallelism, Kubernetes probe configuration, Express and Next.js implementation patterns, performance budgets, security considerations, and Prometheus/Datadog integration.

JSON Health Check Response Format

The IETF draft-inadarei-api-health-check defines a standard JSON schema for health check responses. The three status values — "pass", "fail", and "warn" — carry precise semantics. "pass" means all checks succeeded and the service is fully operational. "fail" means the service cannot handle requests; the HTTP status code should be 503. "warn" means the service is operational but degraded — some non-critical dependency is unavailable — and the HTTP status code remains 200. This distinction matters: a monitoring system that treats "warn" as an outage will page you at 2 AM when a non-critical cache layer is slow; one that ignores "warn" entirely will miss genuine degradation. The checks object maps component names to arrays of check results, allowing multiple instances of the same dependency to be reported individually.

// IETF draft-inadarei-api-health-check response format
// Content-Type: application/health+json

// ── Passing response (HTTP 200) ────────────────────────────────────────
{
  "status": "pass",
  "version": "1.4.2",
  "releaseId": "abc123def456",
  "serviceId": "orders-api",
  "description": "Order processing service",
  "checks": {
    "postgres:connection": [
      {
        "status": "pass",
        "componentId": "db-primary",
        "componentType": "datastore",
        "observedValue": 12,
        "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z"
      }
    ],
    "redis:connection": [
      {
        "status": "pass",
        "componentId": "cache-01",
        "componentType": "datastore",
        "observedValue": 3,
        "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z"
      }
    ],
    "stripe:reachability": [
      {
        "status": "pass",
        "componentType": "system",
        "observedValue": 87,
        "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z"
      }
    ]
  }
}

// ── Warning response (HTTP 200) — degraded but operational ────────────
{
  "status": "warn",
  "version": "1.4.2",
  "releaseId": "abc123def456",
  "checks": {
    "postgres:connection": [
      { "status": "pass", "observedValue": 14, "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z" }
    ],
    "redis:connection": [
      {
        "status": "warn",
        "output": "Redis connection timeout after 2000ms — falling back to DB cache",
        "observedValue": 2001,
        "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z"
      }
    ]
  }
}

// ── Failing response (HTTP 503) ───────────────────────────────────────
{
  "status": "fail",
  "version": "1.4.2",
  "releaseId": "abc123def456",
  "output": "Critical dependency postgres is unreachable",
  "checks": {
    "postgres:connection": [
      {
        "status": "fail",
        "output": "Connection refused: ECONNREFUSED 10.0.0.5:5432",
        "time": "2026-02-05T10:00:00.000Z"
      }
    ],
    "redis:connection": [
      { "status": "pass", "observedValue": 4, "observedUnit": "ms",
        "time": "2026-02-05T10:00:00.000Z" }
    ]
  }
}

// ── HTTP status code mapping ──────────────────────────────────────────
// "pass"  →  HTTP 200
// "warn"  →  HTTP 200 (service is operational; monitoring interprets warn)
// "fail"  →  HTTP 503 Service Unavailable

The component name in checks follows the convention componentName:measurementName. Using a structured name like postgres:responseTime allows monitoring tools to parse component type from the key. The array value for each check supports reporting multiple instances — for example, two database replicas or a primary and read replica — each with its own status and response time. Return Content-Type: application/health+json per the IETF draft; load balancers and probes that inspect the content type will correctly identify the response format.

Checking Dependencies in Health Responses

A health check that only reports the process status is only half useful. The value of the checks object is reporting the status of each external dependency individually so operations teams and automated systems can pinpoint exactly which component is failing. Four categories of dependency checks cover most services: database connectivity, cache connectivity, external API reachability, and message queue depth. Each check should measure and report its own responseTime — latency spikes on a database connection are an early warning of an impending failure even when the check still passes.

// Dependency check implementations

// ── PostgreSQL ping ───────────────────────────────────────────────────
import { Pool } from 'pg'

async function checkPostgres(pool: Pool): Promise<CheckResult> {
  const start = Date.now()
  try {
    await Promise.race([
      pool.query('SELECT 1'),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Timeout')), 2000)
      ),
    ])
    return {
      status: 'pass',
      componentType: 'datastore',
      observedValue: Date.now() - start,
      observedUnit: 'ms',
      time: new Date().toISOString(),
    }
  } catch (err) {
    return {
      status: 'fail',
      output: (err as Error).message,
      time: new Date().toISOString(),
    }
  }
}

// ── Redis ping ─────────────────────────────────────────────────────────
import { createClient } from 'redis'

async function checkRedis(client: ReturnType<typeof createClient>): Promise<CheckResult> {
  const start = Date.now()
  try {
    await Promise.race([
      client.ping(),                  // returns 'PONG'
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error('Timeout')), 2000)
      ),
    ])
    return {
      status: 'pass',
      componentType: 'datastore',
      observedValue: Date.now() - start,
      observedUnit: 'ms',
      time: new Date().toISOString(),
    }
  } catch (err) {
    return {
      status: 'warn',              // Redis unavailable: degraded, not failed
      output: (err as Error).message,
      time: new Date().toISOString(),
    }
  }
}

// ── External HTTP API reachability ────────────────────────────────────
async function checkExternalApi(url: string, name: string): Promise<CheckResult> {
  const start = Date.now()
  const controller = new AbortController()
  const timer = setTimeout(() => controller.abort(), 3000)

  try {
    const res = await fetch(url, { method: 'HEAD', signal: controller.signal })
    clearTimeout(timer)
    return {
      status: res.ok ? 'pass' : 'warn',
      componentType: 'system',
      observedValue: Date.now() - start,
      observedUnit: 'ms',
      time: new Date().toISOString(),
    }
  } catch (err) {
    clearTimeout(timer)
    return {
      status: 'warn',
      output: `${name} unreachable: ${(err as Error).message}`,
      time: new Date().toISOString(),
    }
  }
}

// ── Run all checks in parallel with Promise.allSettled ────────────────
// IMPORTANT: Never use Promise.all — one rejection would prevent all
// other check results from being collected.

type CheckResult = {
  status: 'pass' | 'fail' | 'warn'
  componentType?: string
  componentId?: string
  observedValue?: number
  observedUnit?: string
  output?: string
  time: string
}

async function runAllChecks(deps: {
  pgPool: Pool
  redisClient: ReturnType<typeof createClient>
}) {
  const [pgResult, redisResult, stripeResult] = await Promise.allSettled([
    checkPostgres(deps.pgPool),
    checkRedis(deps.redisClient),
    checkExternalApi('https://api.stripe.com/', 'stripe'),
  ])

  const resolve = (r: PromiseSettledResult<CheckResult>): CheckResult =>
    r.status === 'fulfilled'
      ? r.value
      : { status: 'fail', output: r.reason?.message, time: new Date().toISOString() }

  const checks = {
    'postgres:connection': [resolve(pgResult)],
    'redis:connection':    [resolve(redisResult)],
    'stripe:reachability': [resolve(stripeResult)],
  }

  // Aggregate: fail if any critical check fails; warn if non-critical fails
  const critical = [resolve(pgResult)]
  const aggregateStatus =
    critical.some(c => c.status === 'fail')    ? 'fail' :
    Object.values(checks).flat().some(c => c.status === 'warn') ? 'warn' : 'pass'

  return { aggregateStatus, checks }
}

Use Promise.allSettled instead of Promise.all — this is critical. Promise.all rejects immediately on the first failure, meaning if your Redis check fails, you never collect the PostgreSQL or Stripe results. Promise.allSettled always resolves with the result of every check, settled or rejected. Classify each dependency as critical or non-critical at configuration time: a critical check failure drives the aggregate status to "fail" (HTTP 503), while a non-critical check failure drives it to "warn" (HTTP 200).

Liveness vs Readiness vs Startup Probes

Kubernetes defines three probe types with fundamentally different behaviors on failure. Getting this distinction wrong is one of the most common causes of unnecessary pod restarts during dependency outages. The key rule: liveness probes should never check external dependencies. If your database goes down, killing and restarting your application pod will not fix the database — it will cause a restart storm that amplifies the incident and may trigger a CrashLoopBackOff state, making recovery slower.

# ── Kubernetes probe configuration ────────────────────────────────────
apiVersion: apps/v1
kind: Deployment
metadata:
  name: orders-api
spec:
  template:
    spec:
      containers:
        - name: orders-api
          image: orders-api:1.4.2
          ports:
            - containerPort: 3000

          # LIVENESS PROBE — "Is the process alive?"
          # Failure: kubelet KILLS and RESTARTS the container
          # Rule: NEVER check external dependencies here.
          #        Only verify the event loop responds.
          livenessProbe:
            httpGet:
              path: /health/live         # returns {"status":"pass"} only
              port: 3000
            initialDelaySeconds: 10      # wait for app to start before checking
            periodSeconds: 10            # check every 10 seconds
            timeoutSeconds: 1            # must respond in < 1s
            failureThreshold: 3          # 3 consecutive failures before restart
            successThreshold: 1

          # READINESS PROBE — "Can this pod serve traffic?"
          # Failure: kubelet REMOVES pod from Service endpoints (no restart)
          # Rule: Check all critical dependencies here.
          #        Pod won't receive requests until ready.
          readinessProbe:
            httpGet:
              path: /health/ready        # full dependency check
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 10
            timeoutSeconds: 3            # up to 3s for dependency checks
            failureThreshold: 3          # 3 consecutive failures before removing from LB
            successThreshold: 2          # 2 successes needed to re-add to LB

          # STARTUP PROBE — "Has the app finished initializing?"
          # Disables liveness/readiness until startup succeeds.
          # Useful for apps with slow initialization (DB migrations, etc.)
          startupProbe:
            httpGet:
              path: /health/live
              port: 3000
            failureThreshold: 30         # allow up to 30 * 10s = 5 min to start
            periodSeconds: 10

---
# ── What each probe failure does ──────────────────────────────────────
# livenessProbe failure:
#   → Container is KILLED → restartPolicy applied (usually Restart)
#   → Generates a "Killing container" event in kubectl describe pod
#   → Pod stays in the same node, same IP (unless rescheduled)
#
# readinessProbe failure:
#   → Pod REMOVED from Service endpoints → no new traffic routed to it
#   → Existing connections finish normally
#   → Pod is NOT killed — it can recover and re-enter the LB pool
#   → kubectl get pods shows READY 0/1
#
# startupProbe failure (all failureThreshold attempts):
#   → Container is KILLED → same as liveness failure
#   → Prevents liveness/readiness from firing during startup

---
# ── Per-probe endpoint recommendations ────────────────────────────────
# GET /health/live    → {"status":"pass"}           always fast, no deps
# GET /health/ready   → full IETF health check      with dependency checks
# GET /health         → same as /health/ready       for external monitoring

Set timeoutSeconds on the liveness probe to 1 second — if your liveness endpoint takes longer than 1 second to respond, something is seriously wrong with the process (not a dependency). Set timeoutSeconds on readiness to 3–5 seconds to give dependency checks time to complete. The failureThreshold on readiness should be at least 3 — single probe failures can be transient network blips. For readiness, successThreshold: 2 means the pod needs two consecutive passing checks before re-entering the load balancer pool, reducing flapping during recovery.

Implementing Health Checks in Express and Next.js

Both Express and Next.js App Router support health check routes. The implementation pattern is the same: a liveness route that returns immediately, a readiness route that runs all dependency checks in parallel, and a shared health check module that manages caching and result aggregation. The dependency check module should be initialized once at application startup with references to existing connection pools — creating new connections on each health check request is an anti-pattern that adds latency and exhausts connection limits under probe-heavy load.

// ── Express implementation ────────────────────────────────────────────
import express from 'express'
import { runAllChecks } from './health/checks'

const app = express()

// Liveness: event loop alive, no dependencies
app.get('/health/live', (_req, res) => {
  res.status(200).json({ status: 'pass' })
})

// Readiness: full dependency check
app.get('/health/ready', async (_req, res) => {
  const { aggregateStatus, checks } = await runAllChecks({
    pgPool,
    redisClient,
  })

  const httpStatus = aggregateStatus === 'fail' ? 503 : 200

  res.status(httpStatus)
     .set('Content-Type', 'application/health+json')
     .json({
       status: aggregateStatus,
       version: process.env.npm_package_version ?? 'unknown',
       releaseId: process.env.RELEASE_ID ?? 'unknown',
       checks,
     })
})

// ── Next.js App Router implementation ─────────────────────────────────
// app/api/health/live/route.ts
import { NextResponse } from 'next/server'

export const dynamic = 'force-dynamic'   // never cache liveness route

export async function GET() {
  return NextResponse.json(
    { status: 'pass' },
    { status: 200, headers: { 'Content-Type': 'application/health+json' } }
  )
}

// app/api/health/ready/route.ts
import { NextResponse } from 'next/server'
import { runAllChecks } from '@/lib/health/checks'
import { getCachedHealth, setCachedHealth } from '@/lib/health/cache'

export const dynamic = 'force-dynamic'

export async function GET() {
  // Return cached result if fresh (5-second TTL)
  const cached = getCachedHealth()
  if (cached) {
    const httpStatus = cached.status === 'fail' ? 503 : 200
    return NextResponse.json(cached, {
      status: httpStatus,
      headers: { 'Content-Type': 'application/health+json' },
    })
  }

  const { aggregateStatus, checks } = await runAllChecks()

  const body = {
    status: aggregateStatus,
    version: process.env.npm_package_version ?? 'unknown',
    releaseId: process.env.RELEASE_ID ?? 'unknown',
    checks,
  }

  setCachedHealth(body)

  const httpStatus = aggregateStatus === 'fail' ? 503 : 200
  return NextResponse.json(body, {
    status: httpStatus,
    headers: { 'Content-Type': 'application/health+json' },
  })
}

// ── Health check cache (lib/health/cache.ts) ──────────────────────────
type HealthBody = {
  status: 'pass' | 'fail' | 'warn'
  version: string
  releaseId: string
  checks: Record<string, unknown[]>
}

let cache: { body: HealthBody; expiresAt: number } | null = null
const TTL_MS = 5_000   // 5-second cache

export function getCachedHealth(): HealthBody | null {
  if (cache && Date.now() < cache.expiresAt) return cache.body
  return null
}

export function setCachedHealth(body: HealthBody): void {
  cache = { body, expiresAt: Date.now() + TTL_MS }
}

In Next.js, set export const dynamic = 'force-dynamic' on health routes to prevent the response from being statically rendered or cached at the framework level. The 5-second in-memory cache in the route handler is separate — it prevents redundant dependency checks when probes fire simultaneously, but it never serves a stale response from the CDN or Next.js static cache. In Express, initialize connection pools before starting the HTTP server so the first readiness probe does not fail due to connections still being established.

Performance Considerations for Health Endpoints

Health endpoints are called far more frequently than most application routes. A Kubernetes cluster with 10 pods, liveness probes every 10 seconds, readiness probes every 10 seconds, and a Prometheus scrape every 15 seconds generates roughly 140 health check requests per minute — before external monitoring tools are added. Poor health check performance can paradoxically cause the health check itself to fail the probe timeout, marking healthy pods as unhealthy. Three strategies address this: lightweight ping operations, parallel execution with Promise.allSettled, and short-lived result caching.

// ── Performance: lightweight pings only ──────────────────────────────

// BAD: application query used as health check
// This holds a DB connection, runs query planning, returns all rows
async function checkPostgresBad(pool: Pool) {
  await pool.query('SELECT count(*) FROM orders')  // ← NEVER do this
}

// GOOD: minimal ping — tests connectivity only
async function checkPostgresGood(pool: Pool) {
  await pool.query('SELECT 1')  // fastest possible query, no table access
}

// ── Performance: parallel execution ──────────────────────────────────
// Serial (bad): total time = sum of all check times
async function runChecksBad() {
  const pg = await checkPostgres(pool)    // 12ms
  const rd = await checkRedis(client)     // 4ms
  const st = await checkStripe()          // 87ms
  // Total: ~103ms
}

// Parallel (good): total time = max of all check times
async function runChecksGood() {
  const [pg, rd, st] = await Promise.allSettled([
    checkPostgres(pool),     // 12ms
    checkRedis(client),      // 4ms
    checkStripe(),           // 87ms
  ])
  // Total: ~87ms (limited by slowest check)
}

// ── Performance: result caching with atomic update ────────────────────
// Problem: 10 probes fire simultaneously → 10 concurrent dependency check rounds
// Solution: in-flight deduplication + short TTL cache

let inflightPromise: Promise<HealthResult> | null = null

async function getCachedOrFreshHealth(): Promise<HealthResult> {
  const cached = getCachedHealth()
  if (cached) return cached

  // Deduplicate concurrent requests: return same in-flight promise
  if (!inflightPromise) {
    inflightPromise = runAllChecks().then(result => {
      setCachedHealth(result)
      inflightPromise = null
      return result
    })
  }

  return inflightPromise
}

// ── Response time budgets ─────────────────────────────────────────────
// Liveness  → must complete in < 100ms (even 50ms budget preferred)
//             Kubernetes timeoutSeconds: 1
//
// Readiness → must complete in < 500ms total for all checks
//             Individual check timeouts: 300ms each (parallel, so 300ms wall clock)
//             Kubernetes timeoutSeconds: 3
//
// If your fastest dependency check (SELECT 1) takes > 100ms,
// that is itself a signal worth alerting on — not hiding in health.

// ── Separate connection pools for health checks ────────────────────────
// Health check queries should not compete with application queries
// for connection pool slots. Use a dedicated single-connection pool:
const healthPool = new Pool({
  ...dbConfig,
  max: 1,               // never take more than 1 connection
  connectionTimeoutMillis: 1000,
  idleTimeoutMillis: 30_000,
})

Using a dedicated single-connection database pool for health checks prevents health checks from starving application queries of pool connections under load. With a shared pool at maximum capacity, a health check SELECT 1 would queue behind application queries, causing the health check to time out and the pod to be marked unhealthy — exactly the wrong behavior when the database is actually healthy but busy. The 5-second result cache is the single highest-impact performance improvement: it turns N concurrent probe calls into one dependency check round per 5 seconds.

Health Check Security

Health endpoints are frequently left unauthenticated — load balancers, Kubernetes probes, and external uptime monitors all need to reach them without credentials. This convenience creates an information disclosure surface if the endpoint reveals sensitive internal details. The key principle is to differentiate between what unauthenticated external callers need (aggregate status) and what authenticated internal systems need (full dependency diagnostics with response times and component identifiers).

// ── Public health endpoint: aggregate status only ─────────────────────
// Reachable from load balancers, external monitors, Kubernetes probes
// Never exposes dependency names, internal IPs, stack traces

app.get('/health', (_req, res) => {
  const { aggregateStatus } = getCachedOrFreshHealth()
  res.status(aggregateStatus === 'fail' ? 503 : 200)
     .set('Content-Type', 'application/health+json')
     .json({
       status: aggregateStatus,
       version: process.env.npm_package_version,
       // NO "checks" object — hides internal topology
     })
})

// ── Internal health endpoint: full diagnostics ─────────────────────────
// Protected by network policy (cluster-internal) OR shared secret header
// Used by operations dashboards, Prometheus, Datadog

app.get('/health/detailed', requireInternalAuth, async (_req, res) => {
  const result = await getCachedOrFreshHealth()
  res.status(result.aggregateStatus === 'fail' ? 503 : 200)
     .set('Content-Type', 'application/health+json')
     .json(result)    // full response with checks, response times, component IDs
})

function requireInternalAuth(req: Request, res: Response, next: NextFunction) {
  const token = req.headers['x-internal-token']
  if (token !== process.env.INTERNAL_HEALTH_TOKEN) {
    return res.status(403).json({ error: 'Forbidden' })
  }
  next()
}

// ── What NOT to expose in public health responses ─────────────────────

// BAD: exposes internal hostname, port, and DB name
{
  "checks": {
    "postgres:connection": [{
      "status": "fail",
      "output": "Connection refused: ECONNREFUSED db-prod-01.internal:5432/orders_db"
    }]
  }
}

// GOOD: generic component identifier only
{
  "checks": {
    "database:connection": [{
      "status": "fail",
      "output": "Primary database unreachable"
      // no hostname, no port, no database name
    }]
  }
}

// ── Security checklist ─────────────────────────────────────────────────
// ✗ Never include: connection strings, internal hostnames, IP addresses
// ✗ Never include: API keys, tokens, credentials of any kind
// ✗ Never include: stack traces or error messages with library versions
// ✗ Never include: file paths from the server filesystem
// ✗ Never include: environment variable names that reveal configuration
// ✓ Safe to include: aggregate status, version, releaseId
// ✓ Safe in internal health: component names, response times, error summaries
// ✓ Safe in internal health: ISO 8601 timestamps per check

Protect the detailed health endpoint with a Kubernetes NetworkPolicy that only allows access from within the cluster namespace, combined with a shared secret header for defense in depth. Do not rely solely on network policy — operators accessing the cluster via port-forwarding bypass it. Use separate routes (/health for public, /health/detailed for internal) rather than a query parameter toggle (/health?full=true) — a parameter-based approach is easier to accidentally misconfigure and expose.

Monitoring and Alerting from Health Checks

Health endpoints are the primary source of truth for service availability in most monitoring setups. Prometheus, Datadog, and similar tools can scrape your health endpoint and convert the JSON response into time-series metrics. The key shift from binary (up/down) to dimensional monitoring is using the per-component checks data to create separate metric series for each dependency — so a database latency spike is visible as a separate signal from a payment API timeout.

// ── Prometheus: expose /metrics with per-component health gauges ──────
import { register, Gauge } from 'prom-client'

const healthStatus = new Gauge({
  name: 'service_health_status',
  help: 'Health check status: 1=pass, 0.5=warn, 0=fail',
  labelNames: ['component'],
})

const healthResponseTime = new Gauge({
  name: 'service_health_response_time_ms',
  help: 'Health check response time in milliseconds',
  labelNames: ['component'],
})

// Run checks and update Prometheus metrics
async function updateHealthMetrics() {
  const { checks } = await runAllChecks()

  for (const [component, results] of Object.entries(checks)) {
    const result = results[0] as { status: string; observedValue?: number }
    const statusValue =
      result.status === 'pass' ? 1 :
      result.status === 'warn' ? 0.5 : 0

    healthStatus.set({ component }, statusValue)
    if (result.observedValue !== undefined) {
      healthResponseTime.set({ component }, result.observedValue)
    }
  }
}

// Update metrics every 15 seconds (independent of probe calls)
setInterval(updateHealthMetrics, 15_000)

// Expose Prometheus scrape endpoint
app.get('/metrics', async (_req, res) => {
  res.set('Content-Type', register.contentType)
  res.end(await register.metrics())
})

// ── Prometheus alerting rules (prometheus.yml) ─────────────────────────
// Alert when any component fails for more than 1 minute
/*
groups:
  - name: health_check_alerts
    rules:
      - alert: ServiceHealthFail
        expr: service_health_status{component="postgres:connection"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL health check failing for {{ $labels.component }}"

      - alert: ServiceHealthWarn
        expr: service_health_status < 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Service health degraded for {{ $labels.component }}"

      - alert: HealthCheckSlow
        expr: service_health_response_time_ms{component="postgres:connection"} > 200
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Database health check latency elevated: {{ $value }}ms"
*/

// ── Datadog synthetic test (push via API) ─────────────────────────────
// Use Datadog Synthetics HTTP test to scrape /health
// Assert: response body contains '"status":"pass"'
// Assert: response time < 500ms
// Alert: when test fails for 2 consecutive checks from 2 locations

// ── Datadog Agent integration (datadog-agent.yaml) ────────────────────
/*
init_config:

instances:
  - name: orders-api-health
    url: http://orders-api.default.svc.cluster.local:3000/health
    timeout: 3
    content_match: '"status":"pass"'
    collect_response_time: true
    tags:
      - service:orders-api
      - env:production
*/

Track the observedValue (response time) for each dependency check as a time-series metric — latency trends are leading indicators of failures. A PostgreSQL health check that takes 12ms on Monday but 180ms on Wednesday signals a performance regression before connections start timing out. Set alert thresholds at the for: 2m level (not immediately) to avoid paging on transient blips; most real outages persist longer than two minutes. For SLA reporting, compute the percentage of health check intervals where the aggregate status was "pass" over a rolling 30-day window.

Key Terms

Liveness Probe
A Kubernetes probe that answers the question "Is this container still alive and not deadlocked?" Kubelet runs the probe on a configured interval; if the probe fails failureThreshold consecutive times, kubelet kills the container and restarts it according to the pod's restartPolicy. Liveness probes must never check external dependencies like databases or caches. If a dependency goes down, restarting your application pod will not fix the dependency, but will cause a restart storm that amplifies the incident. A correct liveness probe checks only that the application process is responding — for example, returning HTTP 200 from a route that does nothing except return {"status": "pass"}. Response time budget: under 100ms.
Readiness Probe
A Kubernetes probe that answers the question "Is this pod ready to serve traffic?" If a readiness probe fails, Kubernetes removes the pod from the Service's endpoint list — the load balancer stops routing new requests to it — but does NOT kill or restart the container. This is the correct behavior when a dependency like a database is temporarily unavailable: the pod stops receiving traffic until the dependency recovers, then re-enters the load balancer pool automatically. Readiness probes should check all critical dependencies using lightweight ping operations. Response time budget: under 500ms total for all checks run in parallel.
Startup Probe
A Kubernetes probe that answers the question "Has the application finished initializing?" Until the startup probe succeeds, liveness and readiness probes are disabled — preventing kubelet from killing a slow-starting container before it has had time to initialize. This is important for applications that run database migrations, warm up caches, or load large models on startup. Configure failureThreshold and periodSeconds to cover the maximum expected startup time: if startup takes up to 5 minutes, set failureThreshold: 30 and periodSeconds: 10. Once the startup probe passes, normal liveness and readiness probing begins.
Health Check
An HTTP endpoint that reports the operational status of a service and its dependencies. A minimal health check returns only an HTTP status code (200 or 503). A production-grade JSON health check follows the IETF draft-inadarei-api-health-check format: it returns a JSON body with a top-level status field ("pass", "fail", or "warn"), a version field, a releaseId field, and a checks object reporting the status and response time of each external dependency individually. Content-Type should be application/health+json.
Dependency Check
A lightweight probe of an external service that the application depends on, executed as part of a readiness health check. Each dependency check should use the minimal possible operation: SELECT 1 for PostgreSQL, PING for Redis, a HEAD request for HTTP APIs. Each check records its own responseTime in milliseconds and a status of "pass", "warn", or "fail". All dependency checks should run in parallel using Promise.allSettled — never serially, and never using Promise.all which stops collecting results on the first failure.
Status (pass/fail/warn)
The three status values defined by the IETF draft-inadarei-api-health-check specification. "pass" means the service and all checked dependencies are fully operational; the HTTP response code is 200. "fail" means the service is down or unable to handle requests due to a critical dependency failure; the HTTP response code is 503. "warn" means the service can still handle requests but is in a degraded state — typically a non-critical dependency (cache, feature flag service, analytics pipeline) is unavailable; the HTTP response code remains 200. The "warn" status enables monitoring systems to distinguish degraded-but-operational from fully-failed, reducing alert fatigue while still surfacing problems.
SLA (Service Level Agreement)
A contractual commitment to a minimum level of service availability, typically expressed as a percentage uptime over a rolling time window (e.g., 99.9% uptime per month = less than 43.8 minutes of downtime). Health check endpoints are the primary data source for SLA measurement: the percentage of health check intervals where the aggregate status was "pass" approximates service availability. Prometheus and Datadog both support SLA dashboards built on health check metrics. A change in health check status from "pass" to "fail" starts an outage window; "warn" may or may not count against SLA depending on the contractual definition.
Promise.allSettled
A JavaScript/TypeScript built-in that takes an array of Promises and returns a single Promise that resolves when all input Promises have settled — either fulfilled or rejected. Unlike Promise.all, which short-circuits on the first rejection, Promise.allSettled always collects every result. Each element of the resolved array is either {"{ status: 'fulfilled', value: T }"} or {"{ status: 'rejected', reason: Error }"}. In health check implementations, this is essential: if a Redis check rejects (connection timeout), Promise.allSettled still collects the PostgreSQL and Stripe results, allowing the health response to report all dependency statuses accurately rather than failing entirely.

FAQ

What should a JSON health check API response look like?

A production-grade JSON health check response should include a top-level status field ("pass", "fail", or "warn"), a version field, a releaseId for the current build, and a checks object reporting each dependency individually. The IETF draft-inadarei-api-health-check format is the de facto standard: {"status":"pass","version":"1.4.2","releaseId":"abc123","checks":{"postgres:connection":[{"status":"pass","observedValue":12,"observedUnit":"ms"}]}}. The HTTP status code should be 200 for "pass" and "warn", and 503 for "fail". The Content-Type header should be application/health+json. Including per-dependency status and response times allows monitoring tools to distinguish which specific component is failing, and latency trends in observedValue serve as early warning of impending failures. Never go beyond a plain {"status": "ok"} for production services — it provides no operational value when something goes wrong.

What is the IETF health check draft format (pass/fail/warn)?

The IETF draft "Health Check Response Format for HTTP APIs" (draft-inadarei-api-health-check) defines a standard JSON schema for health endpoints. It specifies three status values with precise semantics: "pass" means fully operational (HTTP 200), "fail" means unable to serve requests (HTTP 503), and "warn" means operational but degraded (HTTP 200). The top-level response fields include status (required), version, releaseId, description, notes, output, and checks. Each entry in checks is an array of check result objects named with the convention componentName:measurementName (e.g., postgres:responseTime). Each check result contains status, componentId, componentType, observedValue, observedUnit, and time. The draft is not yet an RFC but has been adopted by most health check libraries including @godaddy/terminus, fastify-healthcheck, and Spring Boot Actuator. Using this format makes your health endpoint parseable by monitoring tools and SRE tooling designed around the draft.

What is the difference between a liveness probe and a readiness probe in Kubernetes?

A liveness probe asks "Is this process alive?" — failure causes kubelet to kill and restart the container. A readiness probe asks "Can this pod serve traffic?" — failure causes Kubernetes to remove the pod from the Service load balancer without killing it. The critical rule: liveness probes must never check external dependencies. If your database goes down, restarting your application pod will not fix it, but will trigger a restart storm and potentially a CrashLoopBackOff. A liveness probe should only verify that the application event loop is responding — an endpoint that returns {"{'{'} "status": "pass" {'}'}"} in under 100ms with no external calls. The readiness probe is the correct place for dependency checks: if the database is down, the pod stops receiving traffic (readiness fails) but the process stays alive and will automatically rejoin the load balancer when the database recovers. A third probe type, the startup probe, prevents liveness and readiness from firing until the application has finished initializing — essential for apps that run migrations on startup.

How do I check database and Redis connectivity in a health check endpoint?

Use lightweight ping operations and always run them in parallel with Promise.allSettled. For PostgreSQL: execute SELECT 1 with a 2-second timeout using your existing connection pool (or a dedicated single-connection health pool to avoid starving application queries). For Redis: call client.ping() which returns PONG in under 5ms on a healthy connection. For external HTTP APIs: send a HEAD request with an AbortController timeout of 3 seconds. Each check should record its own startTime before the operation and compute responseTime = Date.now() - startTime after. Wrap each check in a try/catch — if it throws, return {"{'{'} "status": "fail", "output": err.message {'}'}"}. Use Promise.allSettled([checkPostgres(), checkRedis(), checkStripe()]) — not Promise.all — so a Redis timeout does not prevent the PostgreSQL result from being collected. Classify each dependency as critical (failure drives aggregate to "fail") or non-critical (failure drives aggregate to "warn") at configuration time.

How do I avoid slow health checks impacting my API performance?

Four strategies eliminate health check performance issues. First, use only lightweight ping operations: SELECT 1 for databases, PING for Redis, HEAD for HTTP APIs — never run application business logic queries from a health check. Second, run all dependency checks in parallel with Promise.allSettled: three checks at 50ms each take 50ms total in parallel vs. 150ms serially. Third, cache the health check result for 5 seconds. Kubernetes probes, Prometheus scrapes, and monitoring tools can collectively generate hundreds of health check calls per minute; a 5-second in-memory cache means at most one actual dependency check round per 5 seconds. Fourth, use a dedicated database connection pool with a maximum of one connection for health checks, preventing health probes from competing with application queries for pool slots. Set Kubernetes probe timeoutSeconds to 1 for liveness (it should always respond in well under 100ms) and 3–5 for readiness. If the readiness endpoint exceeds 500ms regularly, the dependency checks themselves are the problem — tighten the per-check timeout and investigate the slow dependency.

What information should I hide from public-facing health check responses?

Public health endpoints (reachable by load balancers, external monitors, and Kubernetes probes) should expose only the aggregate status, version, and releaseId — never the full checks object. Specifically, never include: database connection strings or hostnames, internal IP addresses or service discovery names, port numbers for internal services, API keys or authentication tokens, stack traces or error messages that reveal library versions and code paths, file system paths, or environment variable names. Instead of exposing "output": "ECONNREFUSED db-prod-01.internal:5432/orders_db", expose only "output": "Primary database unreachable" or omit the output field entirely. Expose the full checks object with detailed error information only on an internal health endpoint (e.g., /health/detailed) protected by a Kubernetes NetworkPolicy combined with a shared secret request header. Treat the health endpoint as part of your API security surface area and include it in security reviews.

How do I integrate JSON health checks with Prometheus or Datadog monitoring?

For Prometheus, expose a /metrics endpoint using the prom-client library and publish gauge metrics derived from your health check results: service_health_status{component="postgres:connection"} (1=pass, 0.5=warn, 0=fail) and service_health_response_time_ms{component="postgres:connection"}. Update these metrics on a 15-second interval independent of probe calls. Write alerting rules using for: 1m to avoid alerting on transient blips: fire ServiceHealthFail when status is 0 for 1 minute, and ServiceHealthSlow when response_time_ms > 200 for 2 minutes. For Datadog, use the http_check Agent integration to scrape /health and assert content_match: '"status":"pass"', or use DogStatsD to push metrics directly from the health check handler using dogstatsd.gauge('api.health.status', value, tags). For both platforms, track health check status as a time-series to compute SLA: the percentage of intervals where status is "pass" over a rolling 30-day window gives your service availability metric.

Further reading and primary sources