v0.0.70 — Agent Reviews, Trace Explorer, Chat↔Cobuild
Released: 2026-05-15. Twenty-three commits.
This is the first big Dataiku-parity batch — Agent Reviews (capability #8), Trace Explorer (#15), Agent Versioning (#16), and the start of bidirectional Chat ↔ Cobuild navigation.
Agent Reviews subsystem (#8)
A complete review-and-judge workflow for agents lands across three slices in this release.
Slice 1 — Reviews + Tests CRUD
Schema for the full subsystem (agent_reviews, agent_test_cases, agent_test_runs, agent_test_results, agent_test_traits) ships now so slice 2's runner has somewhere to write; only Reviews + Tests endpoints and UI are wired in this slice.
AgentReviewsPage lists reviews with search, favorite / agent filters, and a "+ New" modal. AgentReviewPage renders a header + 6 tabs; the Tests tab is fully wired (table + drawer for prompt / reference / expectations). The other 5 tabs are explicit placeholders pointing to slice 2.
Slice 2 — Runner + LLM-as-judge + Results
agent_judge runs two judge prompts in parallel — reference semantic-equivalence + expectations rule-check. Temperature 0, skip on empty inputs, neutral on no grading dimension. combine_overall_status() rolls verdicts → pass / fail / error / neutral.
agent_test_runner.run_all_tests fans out N executions per test, snapshots agent_version onto the run row, persists results + traits, and routes agent invocation through generate_knowledge_response so connector / tools / RAG mirror the live chat path. quick_test runs once with no persistence.
Five new endpoints on /api/agent-reviews/{rid}: GET / POST /runs, GET /runs/{id}, GET /runs/{id}/results (single batched traits query — no N+1), POST /tests/{tid}/quick-test. The page gains a Run All button, inline ⚡ Quick Test on the Tests tab, and a new Results tab with run selector, status breakdown bar, per-test Traits matrix, and result drawer.
Slice 3 — Compare / Settings / Logs
The last three placeholder tabs go live:
- Compare — two-run selector joined on
(test_id, execution_index), shows pass/fail per side +same/regression/improvement/changedchip. - Settings — name / description /
judge_model/executions_per_test/ tags via the existing PATCH endpoint. - Logs — "See trace" button in
ResultDraweropensTraceExplorerDrawerscoped to the test run.
The Compare/Settings slice also wires the trace plumbing: a new ContextVar in tool_executor (_test_run_ctx) binds test_run_id + execution_index into agent_tool_executions rows during a test run. Live chat is unchanged (NULL when unbound).
Trace Explorer (#15)
Walks data already captured in agent_tool_executions and renders it as a tree. New fields land (idempotent ALTER, all nullable): parent_exec_id, span_kind, tokens_in/out, cost_usd, test_run_id, execution_index.
/api/traces (router prefix) exposes:
GET /sessions/{id}— span tree (default) or flatGET /test-runs/{run_id}/results/{result_id}— same, scoped to one test executionGET /executions/{exec_id}— full payload (no truncation)
TraceExplorerDrawer slides in with kind chips, status dots, latency, expandable args/result blocks. Click-to-expand fetches the full payload. ConversationsPage gets a See trace button in the thread header.
Agent versioning (#16)
published_assets gains an active_version pointer separate from current_version (latest). Agents can be edited (which keeps bumping current_version) without changing what consumers serve. NULL preserves today's behavior; setting N pins serving to that historical asset_versions row.
New AgentVersionPicker dropdown in the AgentBuilderPage header replaces the static "Version vN" label. Foundation for Agent Review pinning runs to a version and for Agent Hub (Batch G #9) serving the active version rather than always-latest.
Bidirectional Chat ↔ Cobuild
Closes the loop between a dashboard's Build chat and the Cobuild dock — operators can now jump in both directions without losing context.
GET /api/cobuild/by-asset/{type}/{id}— most recent Cobuild session whose workstream produced this asset. Backs the Spawned by Cobuild pill in the Dashboard Build chat header.POST /api/cobuild/escalate-from-dashboard/{dashboard_id}— spins a new Cobuild session seeded from the dashboard's chat transcript. Optionalnotebody field appended verbatim;context_turnsclamped to[1, 20].GET /api/chats/related/{type}/{id}— escalation tree for an asset (primary chat + ancestor Cobuilds + descendant Cobuilds) in one round trip.GET /api/chats/recent[?scope=]— unified view acrossdashboard_chat_messages,cobuild_messages, andchat_history. Paves the way for a future "option C" unified-history surface.
Frontend:
SpawnedByCobuildPillextracted as a shared component (block + inline variants) — dashboards now, agents and webapps next.- Cobuild dock rows carry a
from <asset_type>chip whensource_asset_typeis set. The chip is clickable and navigates to the source asset;sourceAssetHrefcovers dashboard / recipe / agent. - New
RelatedChatsPaneldrops into a builder page's right rail and renders the asset's full escalation tree. Wired intoDashboardDetailPagebelow the existingAssetReferencesPanel. - Recipe pages get the
SpawnedByCobuildPill(inline variant) next to the recipe title.
Cross-org gating returns 404 (not 403) on /by-asset/ and /related/ so callers can't probe the asset-id space across tenants.
Dashboards — cross-filter cascade
Backend foundation for click-as-filter on dashboard pages. ExecuteAllRequest now accepts cross_filters: list[{column, value, values}] and an optional cross_filter_source_card_id (excluded from the cascade so the originating card always shows its full slice).
Per-card opt-in via card_config.cross_filter_columns. Cards without the opt-in are untouched. Identifier safety: column names must match _PARAM_NAME_RE fullmatch — any unvetted identifier silently skips the wrap rather than risking injection. IN-list selections supported for multi-pick bar/pie clicks. Frontend wiring is a follow-up.
Wire-filter apply — surface real errors
When _snapshot_dashboard raised mid-execute (e.g. a partial SQL statement already ran before the failure), the AsyncSession was left in a pending-rollback state. The try/except: pass caller swallowed the exception but did NOT rollback, so subsequent UPDATEs raised a generic InvalidRequestError and the frontend collapsed it to "Failed to apply". The backend now rolls back after the snapshot except block in both wire-filter apply and apply-suggestions, and a new extractApplyErrorMessage() surfaces HTTP status, structured 4xx details, and network-level err.message so the next prod failure is debuggable from the modal alone.
LDAP sync (RBAC Batch 1)
Per-org LDAP configuration + group-membership sync closes Phase 4 of the RBAC roadmap. groups.ldap_dn has been live since v0.0.55 but had no UI / no sync engine.
New table honeyframe.org_ldap_config (one row per org). Bind password is AES-GCM encrypted using the same key derivation as services.llm_credentials. Sync semantics in services/ldap_sync.py: for each group with ldap_dn IS NOT NULL, query LDAP for members, resolve via user_id_attribute (default mail) → match against honeyframe.users.email scoped to the caller's org. Reconciles user_groups: insert new, delete removed. Local groups (no ldap_dn) are never touched. Users in LDAP but not honeyframe are silently skipped — provisioning is out of scope.
Endpoints under /api/admin/ldap-config (org.admin only): GET, PUT, DELETE, POST /test (bind + small search, supports inline config for try-before-save), POST /sync. Frontend Settings page is a follow-up.
AI dataset descriptions
POST /api/datasets/ai-describe sends a dataset's schema + a small sample to the org's configured LLM and returns strict-JSON with a one-line table description plus per-column descriptions. apply=true writes them into honeyframe.datasets in one transaction; default is preview-only so the UI can show a review-then-apply flow (matches Dataiku's "Generate description" UX). Closes Databricks' "AI-generated table comments" parity gap.
Scenarios — python trigger
scenarios.trigger_type='python' lands. The scheduler evaluates a user-supplied expression every tick inside a strict AST sandbox; truthy result fires the scenario subject to the same 60s throttle as cron. Whitelisted builtins only — no imports, lambdas, or dunders. db_query(sql) accepts SELECT-only single statements; 10s wall-clock budget per evaluation. Closes the python-trigger gap in the competitive matrix.