Platform Runbooks — JustineLonglaT-Lane Docs

Why this page exists

Architecture explains how the platform is designed. Observability shows how it behaves. Runbooks explain how to respond when that behavior needs to be verified, corrected, or recovered.

In the broader Engineering Mesh, runbooks sit after observability and before delivery. They are the layer that turns system signals into operational confidence.

Runbook catalog

These playbooks support repeatable operations across the JLT-Lane sandbox suite, observability stack, and local recovery workflows.

Sandbox Startup

Start the Node.js service, Prometheus, Grafana, and supporting containers in the local sandbox environment, then verify the stack is healthy.

Docker compose startup
Container verification
Metrics endpoint validation
Prometheus target checks

Open runbook →

Prometheus Target Debug

Diagnose missing metrics, scrape failures, or Prometheus targets that are marked down when dashboards show incomplete or empty data.

Check /metrics
Inspect Prometheus targets
Validate Docker networking
Confirm scrape configuration

Open runbook →

Grafana Dashboard Setup

Configure Grafana to use Prometheus as a datasource and build dashboard panels for CPU, memory, and request visibility.

Datasource configuration
Panel creation
PromQL examples
Dashboard validation

Open runbook →

Docker Recovery

Recover the platform when containers fail to start, networks drift, or the local Docker environment becomes unstable.

Restart Docker Desktop
Reset compose stack
Inspect failing containers
Prune stale resources safely

Open runbook →

Metrics Endpoint Debug

Debug the /metrics endpoint when Prometheus cannot scrape data or the application is running without exposing expected telemetry.

Call /metrics directly
Verify service port exposure
Review application logs
Confirm metric registration

Open runbook →

Operating model

These runbooks follow a common troubleshooting pattern designed to keep recovery explainable, safe, and repeatable across environments.

Observe symptom
        ↓
Verify service/container state
        ↓
Check metrics or targets
        ↓
Confirm configuration
        ↓
Restart / recover safely
        ↓
Validate platform behavior

The goal is not only to fix issues, but to make the path to recovery visible and teachable.

Where runbooks sit in the Engineering Mesh

Runbooks are the operational bridge between observability and delivery.

Architecture
        ↓
Sandbox
        ↓
Observability
        ↓
Runbooks
        ↓
Delivery

The sandbox creates a safe place to generate and observe platform signals. Runbooks turn those signals into repeatable operational action.

From runbooks to action

These related pages connect operational procedures to the rest of the platform.

Follow the signals →

Return to the observability architecture to see where metrics, targets, and dashboards originate.

Return to architecture →

Step back into the broader platform map: Engineering Mesh, sandbox, and reliability flow.

Use the toolkit →

Explore supporting scripts, execution patterns, and automation references that reinforce repeatable operations.

Return to MeshHub →

Go back to the documentation control plane and choose another path through the platform.

Read the blog ↗

Follow longer reflections on reliability, platform engineering, and operational discipline.

Planned expansions

Observability stack reset
Container health verification
Incident simulation: Grafana “No Data”
Prometheus scrape failure recovery

The long-term goal is to make this page the operational entry point for troubleshooting and platform recovery across the JLT-Lane ecosystem.