Metric Tree & Anomaly Monitoring, Admin Operator Cockpit, Standalone Worker Fleet, and RDS IAM Authentication

New Features

Metric Tree and Anomaly Monitoring

A new semantic-layer observability stack built around the metric tree: explore how a top-line metric decomposes into its drivers, watch any measure for anomalies over time, and let the analytics agent run root-cause analysis — all from a dedicated Semantic Layer tab in the Developer Portal.

Metric Tree Explorer — A new Semantic Layer tab in the IDE consolidates the semantic explorer and metric tree behind one sidebar entry. Pick a measure and time dimension and explore the tree: see how a metric breaks down into its driver metrics, run sensitivity analysis to find which inputs move the number most, ask what-if questions, and surface the biggest opportunities to investigate.
Anomaly Detection — Define monitors in a .monitor.yml file to watch any measure over time. The detector combines seasonal-trend decomposition (MSTL) with automatic forecasting (AutoETS), supports per-segment detection by fanning out across filters and group_by, and a direction filter (increase, decrease, or both) suppresses false positives. Each scan window excludes the current incomplete period so a partial day or week never reads as a drop.
Insights Inbox — Detected anomalies land in an inbox surface where each one opens an explain drawer with a root-cause breakdown and a follow-up button to take the investigation into a full analytics conversation.
AI Root-Cause Analysis — The analytics agent gains anomaly tools — it can list outstanding anomalies, run detection on demand, and explain a specific anomaly — woven into its existing root-cause reasoning loop, so you can ask “what’s anomalous in revenue this week, and why?” directly in chat.
Scheduled Monitor Scans — Monitor scans run on a first-class cron schedule rather than a fixed hourly tick. A schedule: block in .monitor.yml sets per-granularity cron expressions (daily, weekly, monthly), and each schedule appears in the Schedules settings panel with an amber Monitor job badge and a read-only granularity chip.

# .monitor.yml
schedule:
  daily: "0 6 * * *"
  weekly: "0 6 * * 1"

monitors:
  - measure: orders.revenue
    time_dimension: orders.completed_at
    granularity: day
    lookback_days: 90
    direction: decrease

Scheduled scans fire only when the in-process global worker is enabled (OXY_INPROC_GLOBAL_WORKER=1).

AWS RDS IAM Authentication for Workers

Oxy worker processes can now authenticate to their PostgreSQL application database using AWS RDS IAM credentials instead of a static password — a more secure option for cloud deployments where long-lived database passwords are undesirable.

Two Authentication Modes — A new OXY_DATABASE_AUTH_MODE environment variable selects between password (the default, using OXY_DATABASE_URL) and iam. IAM mode builds the connection from discrete OXY_DATABASE_HOST, OXY_DATABASE_NAME, OXY_DATABASE_USER, and OXY_DATABASE_REGION variables and uses temporary IAM credentials at connect time — no database password to store or rotate.
Fail-Fast Validation — A misconfigured IAM setup is caught before the first connection attempt, with a clear error naming the specific missing variable instead of an opaque connection failure later.
Safe Credential Logging — Database identifiers are masked in logs for both modes — passwords are stripped in password mode and IAM mode logs only host/database — so connection details never leak into worker logs.

Optional OXY_DATABASE_PORT and OXY_DATABASE_SSL_MODE settings default to RDS standards (5432 and require).

Tunable ClickHouse Sources for Airway ELT Pipelines

Airway ELT pipelines reading from ClickHouse now expose the timeout and batching knobs needed to keep long streaming extracts from timing out against slow destinations — no more SOCKET_TIMEOUT aborts surfacing as opaque “error decoding response body” failures mid-run.

Configurable Timeouts — New connect_timeout_secs and read_timeout_secs options on a clickhouse source replace the previously hardcoded client timeouts, so extracts that back-pressure on a slow destination no longer get cut off prematurely.
Pass-Through ClickHouse Settings — A settings map forwards arbitrary ClickHouse settings (such as http_send_timeout, send_timeout, or max_execution_time) as query parameters, giving pipeline authors server-side control over socket timeouts without touching the warehouse config.
Adjustable Batch Size — A batch_size option controls how many rows are written to the destination per commit (previously a fixed 10k). Smaller batches drain faster and keep each read pause well under the timeout window, an easy client-side lever when a destination is the bottleneck.

source:
  kind: clickhouse
  host: <host>
  database: <db>
  read_timeout_secs: 1200
  batch_size: 1000
  settings:
    http_send_timeout: 1200
    send_timeout: 1200

All new options default to the prior behavior, so existing pipelines are unaffected.

Admin Operator Cockpit

The admin surface is rebuilt into a denser, signal-first cockpit for platform operators, and adds new cross-tenant capabilities for understanding spend, managing tenants, and debugging customer issues.

Internal Jobs Ops Console — The internal jobs view becomes an operations console: charts shrink to a compact health ribbon while a dense, expandable jobs table takes center stage. Each failed or dead job drills into a debug panel showing the full error, the decoded task spec, and the exact workspace, organization, and user it belongs to — with a free-text filter across every field. Dead-letter jobs can be re-enqueued or deleted directly from the table.
Human-Readable Worker Identity — Workers now appear as {env}·{host-or-pod}·{short} instead of an opaque worker-<uuid>, so a job can be traced back to the infrastructure that ran it at a glance.
Tenants Management Console — A unified /admin/tenants hub replaces three siloed slide-out sheets with stat cards, a “needs attention” triage feed (stale orgs, orgless users, failed or orphaned workspaces), and recently-created rosters. Full-page detail views for orgs, users, and workspaces are cross-linked, so operators can traverse the User ↔ Org ↔ Workspace relationship without backing out to a list and re-searching at every hop.
LLM Cost Dashboard — A new spend view on the tenants page sums token usage and computes per-model dollar cost over a 7-, 30-, or 90-day window, surfacing a total, a daily trend, a by-model breakdown, and the top accounts by spend.
Cross-Tenant Explorer — A new explorer page — and a Threads group in the ⌘K palette — lets operators search threads and agentic runs across every tenant, enriched with workspace, org, and user, to debug a customer issue without leaving the admin surface.
Universal Admin Search — A ⌘K command palette in the admin topbar searches orgs, users, and workspaces simultaneously, grouped by type, for instant navigation to any tenant entity.
Workspace Navigation Fix — A global operator opening a workspace from an org they don’t belong to is no longer bounced back home; navigation now lands directly on the workspace.

Standalone Worker Fleet

Oxy’s background task processing can now run as its own horizontally-scalable fleet, separate from the API server, as the first phase of multi-instance scaling.

oxy worker Subcommand — Runs the durable task orchestrator as a standalone process with no HTTP server, suitable for independent scaling and rolling deploys. It honors OXY_WORKER_MAX_INFLIGHT, OXY_WORKER_RECOVERY_INTERVAL_SECS, and OXY_WORKER_HEALTH_PORT, shuts down gracefully on SIGTERM/SIGINT with a 30-second drain, and tags every log line and trace span with a stable worker identity.
HTTP-Only Mode — oxy serve --no-workers (or OXY_DISABLE_INPROCESS_WORKERS=1) runs the API server without in-process workers, for deployments that run a separate worker fleet. This is opt-in — the default single-process deployment is unchanged.
Kubernetes Health Probes — oxy worker --health-port <port> exposes /healthz and /readyz on a dedicated lightweight router for liveness and readiness checks.

# HTTP fleet
oxy serve --no-workers --health-port 8080
# Worker fleet (scales independently on queue depth)
oxy worker --health-port 8081

Inline Environment Variables in Airway Pipelines

oxy airway run now substitutes the host process’s environment variables — including a .env file loaded from the working directory — into .airway.yml templates at load time, so a pipeline can reference a secret inline with {{ MY_API_KEY }} instead of routing every credential through the secret manager.

Inline {{ VAR }} Substitution — Any environment variable is available in the minijinja template context when the YAML loads, so source configuration (API tokens, hosts, database names) can be templated directly in the pipeline file.

source:
  kind: rest_api
  config:
    auth:
      type: bearer
      token: "{{ YELP_API_KEY }}"   # substituted from the process environment at load

Platform Improvements

Procedures

Workspace-relative SQL file resolution — The analytics agent now correctly resolves SQL procedure files referenced by paths relative to the workspace root, so procedures that point at SQL files by their workspace path load and execute reliably regardless of where the workspace lives on disk.

API and Documentation

Richer API reference — The OpenAPI specification served by the API now documents both API Key and Bearer (JWT) authentication — how to obtain each credential, which header to use, and how to authorize directly from the Swagger UI. A new CLI quick-start covers the oxy login and oxy api commands, including the authentication flow, common request patterns, and target-environment configuration, with a link out to the full documentation at docs.oxy.tech.

Security and Authentication

API key revocation and expiration now enforced — Validating an API key now rejects revoked (inactive) and expired keys at the authentication layer. Revoking a key takes effect immediately, and keys with an expiration date stop working once they pass it — closing a gap where a revoked or expired key could still authenticate.
Global Owner vs. Global Admin clarified — The platform-wide admin role is now split into two distinct levels: Global Owner (full control, including billing and managing other Global Admins) and Global Admin (operational access to admin tooling, including the new Internal Jobs and Tenants consoles). Billing and global-admin-management routes now strictly require Global Owner. The previous OXY_APP_ADMINS environment variable is renamed to OXY_GLOBAL_ADMINS, with the old name still read as a deprecated fallback.
Closed a billing permission gap — A Global Admin’s automatically-granted organization membership could previously satisfy organization-admin checks, including on billing endpoints (Stripe portal, invoices). Billing and member-management handlers now reject this synthetic membership, so only real organization admins and Global Owners can access billing.

Logging

Quieter debug logs — Running with OXY_DEBUG=true no longer floods the output with raw SQL statements and TLS handshake chatter from underlying framework crates, so your own application logs stay readable. Setting RUST_LOG still overrides this for full verbosity when you need the raw firehose.

Airway and Airhouse Pipelines

Credentials refreshed on every reconnect — Long-running Airhouse ELT loads no longer fail partway through when their short-lived credential expires. The pipeline mints a fresh credential on every reconnect, so a load that runs past the credential’s 15-minute TTL reconnects cleanly instead of failing with an “Airhouse pgwire connect error” mid-run.

Reliability

Telemetry no longer stalls when the Airhouse backend restarts — The observability backend’s long-lived connection to Airhouse now uses TCP keepalive, so when the Airhouse data plane is rolled or evicted a dead peer is detected in about a minute and the connection fails over promptly — instead of silently stalling span, intent, and metric writes for up to two hours.
Faster task wake-ups on TLS-secured databases — The background task router’s live notification connection now honors OXY_DATABASE_SSL_MODE the same way the main connection pool does. On deployments using managed Postgres with a private CA (such as AWS RDS or self-signed in-cluster databases), the router’s TLS handshake had been failing and silently falling back to slower polling, with a router.reconnect warning every five seconds. The listener now connects cleanly, restoring low-latency task wake-ups and ending the reconnect log noise. Operators who need strict certificate validation can still set OXY_DATABASE_SSL_MODE=verify-full.
Server startup no longer crash-loops without a project checkout — Cloud oxy serve deployments that run as API servers with no project checkout in their working directory could fail to start, crash-looping with a Failed to read config from file: No such file or directory error introduced in 0.5.70. The pre-aggregation worker now falls back to a default configuration when no config.yml is present at the startup path, so these deployments start cleanly again — with no pre-aggregations defined, the worker simply idles. Local and project-rooted deployments were unaffected.

​New Features

​Metric Tree and Anomaly Monitoring

​AWS RDS IAM Authentication for Workers

​Tunable ClickHouse Sources for Airway ELT Pipelines

​Admin Operator Cockpit

​Standalone Worker Fleet

​Inline Environment Variables in Airway Pipelines

​Platform Improvements

​Procedures

​API and Documentation

​Security and Authentication

​Logging

​Airway and Airhouse Pipelines

​Reliability