1. Box is stuck “offline” after install
What you see- Edge tab → List → box row shows
status: offlineand red ring on the topology graph last_seen_atis “Never” or stuck on a stale timestamp
-
On the host: is the systemd unit actually up?
If the
edgecontainer is not Up, rundocker compose logs edgeand look for crashes. -
Cosign signature check. If you see lines like
cosign verify failed, the agent is refusing to pull a new image because it can’t validate provenance. Two paths:- Expected (dev/POC) — local-built image: set
OXY_DISABLE_SIGNATURE_VERIFY=truein/opt/oxy-edge/.env, restart the service. - Unexpected (production): the workflow identity may have changed. See edge release notes.
- Expected (dev/POC) — local-built image: set
-
Network reachability. From the host:
You should get a 200. If it hangs or returns SSL errors, the host can’t reach Oxy — fix the firewall/DNS/cert path first.
-
Auth. If you see
bearer rejectedorjwt rejectedin the worker logs:- The device identity may be stale (e.g. you re-added the box without removing the old one). Run Remove device in the UI, then Add device again — the install command will include a fresh secret.
- Network / cert issue → fix the upstream, restart the service.
- Identity issue → re-add the device from the UI.
2. Camera shows no preview
What you see- Edge → List → Cameras row exists but the snapshot thumbnail is broken
- Timeline tab shows the camera in the dropdown but events are empty
-
Is the worker reading the camera at all? Worker logs (
docker compose logs edge | grep <camera_id>) should show a per-camera decode loop. If silent, the camera isn’t in the worker’s config — check the camera’sedge_box_idis set in the UI. -
RTSP URL is reachable from the host. Run
ffprobe <rtsp_url>on the edge host — if it can’t open the stream, neither can the worker. Check camera IP + port + credentials. -
Credentials. If the URL needs auth, make sure the
credentials_refon the camera points at a workspace secret that exists and has the right username/password.
- Wrong site/edge_box binding → edit the camera row in the UI.
- Bad RTSP creds → update the workspace secret; worker picks it up within 30s.
3. Rollout was auto-aborted
What you see- Edge → Rollouts → row shows
Abortedwith a reason likecanary failure rate 25.0% exceeded threshold 5.0%
- Click into the rollout detail page. The summary chip strip tells you how many boxes converged vs failed.
- Scan the per-box table for rows tinted red — these are the failures. The
Last resultcolumn shows one of:reverted— apply ran, the new image didn’t catch up, the agent successfully reverted to the previous image. Box is on the old image now.revert_failed— apply ran, the new image didn’t catch up, AND the rollback also failed. Box is in an unknown state — needs investigation.stuck_at_broken_target— apply ran but there was no prior image to roll back to (first-ever apply). Box is on the broken image.
- For
revertedboxes — the old image still works. Investigate why the new tag broke (docker compose logs edgeon one of the failed hosts) before retrying the rollout. - For
revert_failedorstuck_at_broken_target— SSH to the host and manuallydocker pull <known-good-tag>+ restart. Then fix the rollout target before retrying.
4. Slack alerts are not firing
What you see- Cameras are degraded/stale in the UI, but Slack is silent
- Slack install exists for the org. Settings → Slack — should show a connected workspace.
- Default channel is set for this workspace. Same Slack settings page — the per-workspace default channel must be configured. Without it, alerts are silently dropped (this is documented behavior — picking an arbitrary channel for the operator would be worse).
- Test the wiring. Edge → List → Edge boxes → Test Slack button. If the test message lands in Slack, the wiring works and the silence is about transitions specifically.
- Cooldown. The same camera won’t generate a second alert within 30 minutes of the previous one. If you’re testing by repeatedly toggling, you’ll only see the first transition.
- Startup grace. The first 5 minutes after an Oxy server restart suppress alerts (otherwise every camera would alert because we lost the previous tick’s state).
5. Cohort change isn’t reflected in the next rollout
What you see- You moved box X out of cohort
stagingand intoprod - The New Rollout wizard’s canary cohort picker still shows X as part of
staging
- The picker reads cohorts from the EdgeBoxes list. Refresh the page — the React Query cache invalidates on cohort edit but a stale tab won’t pick it up.
- If still wrong after refresh, check the EdgeBoxes table — the cohort column should reflect the change.
- Refresh the page.
6. Box on legacy bearer auth
What you see- Edge tab banner: “N of M boxes still on legacy bearer auth”
- One or more EdgeBoxes rows show an amber
bearerchip
- The cleanest path is to re-onboard each bearer-mode box: Remove device then Add device. The new install command uses JWT auth from first boot.
- For boxes you can’t reach to re-onboard: an OTA update to the latest worker image will pick up the JWT-minter automatically — set the target tag, watch the cohort, the next
/control/*call after restart will be JWT-signed.
jwt, the cluster operator can flip OXY_REQUIRE_JWT_AUTH=true on the server and the bearer code path is hard-disabled. See JWT cutover runbook.
When you’re really stuck
- Pull the worker logs (
docker compose logs edge --tail=500) and the update agent logs (docker compose logs update-agent --tail=200) and open an issue with both attached. - Note the box’s
device_idand the workspace ID — both are visible in the EdgeBoxes table. - If the issue is signature verification, include the agent’s
cosign verifyerror line verbatim.