New Features
Metric Tree and Anomaly Monitoring
A new semantic-layer observability stack built around the metric tree: explore how a top-line metric decomposes into its drivers, watch any measure for anomalies over time, and let the analytics agent run root-cause analysis — all from a dedicated Semantic Layer tab in the Developer Portal.- Metric Tree Explorer — A new Semantic Layer tab in the IDE consolidates the semantic explorer and metric tree behind one sidebar entry. Pick a measure and time dimension and explore the tree: see how a metric breaks down into its driver metrics, run sensitivity analysis to find which inputs move the number most, ask what-if questions, and surface the biggest opportunities to investigate.
- Anomaly Detection — Define monitors in a
.monitor.ymlfile to watch any measure over time. The detector combines seasonal-trend decomposition (MSTL) with automatic forecasting (AutoETS), supports per-segment detection by fanning out acrossfiltersandgroup_by, and a direction filter (increase,decrease, orboth) suppresses false positives. Each scan window excludes the current incomplete period so a partial day or week never reads as a drop. - Insights Inbox — Detected anomalies land in an inbox surface where each one opens an explain drawer with a root-cause breakdown and a follow-up button to take the investigation into a full analytics conversation.
- AI Root-Cause Analysis — The analytics agent gains anomaly tools — it can list outstanding anomalies, run detection on demand, and explain a specific anomaly — woven into its existing root-cause reasoning loop, so you can ask “what’s anomalous in revenue this week, and why?” directly in chat.
- Scheduled Monitor Scans — Monitor scans run on a first-class cron schedule rather than a fixed hourly tick. A
schedule:block in.monitor.ymlsets per-granularity cron expressions (daily, weekly, monthly), and each schedule appears in the Schedules settings panel with an amber Monitor job badge and a read-only granularity chip.
OXY_INPROC_GLOBAL_WORKER=1).
AWS RDS IAM Authentication for Workers
Oxy worker processes can now authenticate to their PostgreSQL application database using AWS RDS IAM credentials instead of a static password — a more secure option for cloud deployments where long-lived database passwords are undesirable.- Two Authentication Modes — A new
OXY_DATABASE_AUTH_MODEenvironment variable selects betweenpassword(the default, usingOXY_DATABASE_URL) andiam. IAM mode builds the connection from discreteOXY_DATABASE_HOST,OXY_DATABASE_NAME,OXY_DATABASE_USER, andOXY_DATABASE_REGIONvariables and uses temporary IAM credentials at connect time — no database password to store or rotate. - Fail-Fast Validation — A misconfigured IAM setup is caught before the first connection attempt, with a clear error naming the specific missing variable instead of an opaque connection failure later.
- Safe Credential Logging — Database identifiers are masked in logs for both modes — passwords are stripped in password mode and IAM mode logs only
host/database— so connection details never leak into worker logs.
OXY_DATABASE_PORT and OXY_DATABASE_SSL_MODE settings default to RDS standards (5432 and require).
Tunable ClickHouse Sources for Airway ELT Pipelines
Airway ELT pipelines reading from ClickHouse now expose the timeout and batching knobs needed to keep long streaming extracts from timing out against slow destinations — no moreSOCKET_TIMEOUT aborts surfacing as opaque “error decoding response body” failures mid-run.
- Configurable Timeouts — New
connect_timeout_secsandread_timeout_secsoptions on aclickhousesource replace the previously hardcoded client timeouts, so extracts that back-pressure on a slow destination no longer get cut off prematurely. - Pass-Through ClickHouse Settings — A
settingsmap forwards arbitrary ClickHouse settings (such ashttp_send_timeout,send_timeout, ormax_execution_time) as query parameters, giving pipeline authors server-side control over socket timeouts without touching the warehouse config. - Adjustable Batch Size — A
batch_sizeoption controls how many rows are written to the destination per commit (previously a fixed 10k). Smaller batches drain faster and keep each read pause well under the timeout window, an easy client-side lever when a destination is the bottleneck.
Admin Operator Cockpit
The admin surface is rebuilt into a denser, signal-first cockpit for platform operators, and adds new cross-tenant capabilities for understanding spend, managing tenants, and debugging customer issues.- Internal Jobs Ops Console — The internal jobs view becomes an operations console: charts shrink to a compact health ribbon while a dense, expandable jobs table takes center stage. Each failed or dead job drills into a debug panel showing the full error, the decoded task spec, and the exact workspace, organization, and user it belongs to — with a free-text filter across every field. Dead-letter jobs can be re-enqueued or deleted directly from the table.
- Human-Readable Worker Identity — Workers now appear as
{env}·{host-or-pod}·{short}instead of an opaqueworker-<uuid>, so a job can be traced back to the infrastructure that ran it at a glance. - Tenants Management Console — A unified
/admin/tenantshub replaces three siloed slide-out sheets with stat cards, a “needs attention” triage feed (stale orgs, orgless users, failed or orphaned workspaces), and recently-created rosters. Full-page detail views for orgs, users, and workspaces are cross-linked, so operators can traverse the User ↔ Org ↔ Workspace relationship without backing out to a list and re-searching at every hop. - LLM Cost Dashboard — A new spend view on the tenants page sums token usage and computes per-model dollar cost over a 7-, 30-, or 90-day window, surfacing a total, a daily trend, a by-model breakdown, and the top accounts by spend.
- Cross-Tenant Explorer — A new explorer page — and a Threads group in the ⌘K palette — lets operators search threads and agentic runs across every tenant, enriched with workspace, org, and user, to debug a customer issue without leaving the admin surface.
- Universal Admin Search — A ⌘K command palette in the admin topbar searches orgs, users, and workspaces simultaneously, grouped by type, for instant navigation to any tenant entity.
- Workspace Navigation Fix — A global operator opening a workspace from an org they don’t belong to is no longer bounced back home; navigation now lands directly on the workspace.
Standalone Worker Fleet
Oxy’s background task processing can now run as its own horizontally-scalable fleet, separate from the API server, as the first phase of multi-instance scaling.oxy workerSubcommand — Runs the durable task orchestrator as a standalone process with no HTTP server, suitable for independent scaling and rolling deploys. It honorsOXY_WORKER_MAX_INFLIGHT,OXY_WORKER_RECOVERY_INTERVAL_SECS, andOXY_WORKER_HEALTH_PORT, shuts down gracefully onSIGTERM/SIGINTwith a 30-second drain, and tags every log line and trace span with a stable worker identity.- HTTP-Only Mode —
oxy serve --no-workers(orOXY_DISABLE_INPROCESS_WORKERS=1) runs the API server without in-process workers, for deployments that run a separate worker fleet. This is opt-in — the default single-process deployment is unchanged. - Kubernetes Health Probes —
oxy worker --health-port <port>exposes/healthzand/readyzon a dedicated lightweight router for liveness and readiness checks.
Inline Environment Variables in Airway Pipelines
oxy airway run now substitutes the host process’s environment variables — including a .env file loaded from the working directory — into .airway.yml templates at load time, so a pipeline can reference a secret inline with {{ MY_API_KEY }} instead of routing every credential through the secret manager.
- Inline
{{ VAR }}Substitution — Any environment variable is available in the minijinja template context when the YAML loads, so source configuration (API tokens, hosts, database names) can be templated directly in the pipeline file.
Platform Improvements
Procedures
- Workspace-relative SQL file resolution — The analytics agent now correctly resolves SQL procedure files referenced by paths relative to the workspace root, so procedures that point at SQL files by their workspace path load and execute reliably regardless of where the workspace lives on disk.
API and Documentation
- Richer API reference — The OpenAPI specification served by the API now documents both API Key and Bearer (JWT) authentication — how to obtain each credential, which header to use, and how to authorize directly from the Swagger UI. A new CLI quick-start covers the
oxy loginandoxy apicommands, including the authentication flow, common request patterns, and target-environment configuration, with a link out to the full documentation at docs.oxy.tech.
Security and Authentication
- API key revocation and expiration now enforced — Validating an API key now rejects revoked (inactive) and expired keys at the authentication layer. Revoking a key takes effect immediately, and keys with an expiration date stop working once they pass it — closing a gap where a revoked or expired key could still authenticate.
- Global Owner vs. Global Admin clarified — The platform-wide admin role is now split into two distinct levels: Global Owner (full control, including billing and managing other Global Admins) and Global Admin (operational access to admin tooling, including the new Internal Jobs and Tenants consoles). Billing and global-admin-management routes now strictly require Global Owner. The previous
OXY_APP_ADMINSenvironment variable is renamed toOXY_GLOBAL_ADMINS, with the old name still read as a deprecated fallback. - Closed a billing permission gap — A Global Admin’s automatically-granted organization membership could previously satisfy organization-admin checks, including on billing endpoints (Stripe portal, invoices). Billing and member-management handlers now reject this synthetic membership, so only real organization admins and Global Owners can access billing.
Logging
- Quieter debug logs — Running with
OXY_DEBUG=trueno longer floods the output with raw SQL statements and TLS handshake chatter from underlying framework crates, so your own application logs stay readable. SettingRUST_LOGstill overrides this for full verbosity when you need the raw firehose.
Airway and Airhouse Pipelines
- Credentials refreshed on every reconnect — Long-running Airhouse ELT loads no longer fail partway through when their short-lived credential expires. The pipeline mints a fresh credential on every reconnect, so a load that runs past the credential’s 15-minute TTL reconnects cleanly instead of failing with an “Airhouse pgwire connect error” mid-run.
Reliability
- Telemetry no longer stalls when the Airhouse backend restarts — The observability backend’s long-lived connection to Airhouse now uses TCP keepalive, so when the Airhouse data plane is rolled or evicted a dead peer is detected in about a minute and the connection fails over promptly — instead of silently stalling span, intent, and metric writes for up to two hours.
- Faster task wake-ups on TLS-secured databases — The background task router’s live notification connection now honors
OXY_DATABASE_SSL_MODEthe same way the main connection pool does. On deployments using managed Postgres with a private CA (such as AWS RDS or self-signed in-cluster databases), the router’s TLS handshake had been failing and silently falling back to slower polling, with arouter.reconnectwarning every five seconds. The listener now connects cleanly, restoring low-latency task wake-ups and ending the reconnect log noise. Operators who need strict certificate validation can still setOXY_DATABASE_SSL_MODE=verify-full. - Server startup no longer crash-loops without a project checkout — Cloud
oxy servedeployments that run as API servers with no project checkout in their working directory could fail to start, crash-looping with aFailed to read config from file: No such file or directoryerror introduced in 0.5.70. The pre-aggregation worker now falls back to a default configuration when noconfig.ymlis present at the startup path, so these deployments start cleanly again — with no pre-aggregations defined, the worker simply idles. Local and project-rooted deployments were unaffected.