Versi: v0.0.81

v0.0.70 — Agent Reviews, Trace Explorer, Chat↔Cobuild

Released: 2026-05-15. Twenty-three commits.

This is the first big Dataiku-parity batch — Agent Reviews (capability #8), Trace Explorer (#15), Agent Versioning (#16), and the start of bidirectional Chat ↔ Cobuild navigation.

Agent Reviews subsystem (#8)

A complete review-and-judge workflow for agents lands across three slices in this release.

Slice 1 — Reviews + Tests CRUD

Schema for the full subsystem (agent_reviews, agent_test_cases, agent_test_runs, agent_test_results, agent_test_traits) ships now so slice 2's runner has somewhere to write; only Reviews + Tests endpoints and UI are wired in this slice.

AgentReviewsPage lists reviews with search, favorite / agent filters, and a "+ New" modal. AgentReviewPage renders a header + 6 tabs; the Tests tab is fully wired (table + drawer for prompt / reference / expectations). The other 5 tabs are explicit placeholders pointing to slice 2.

Slice 2 — Runner + LLM-as-judge + Results

agent_judge runs two judge prompts in parallel — reference semantic-equivalence + expectations rule-check. Temperature 0, skip on empty inputs, neutral on no grading dimension. combine_overall_status() rolls verdicts → pass / fail / error / neutral.

agent_test_runner.run_all_tests fans out N executions per test, snapshots agent_version onto the run row, persists results + traits, and routes agent invocation through generate_knowledge_response so connector / tools / RAG mirror the live chat path. quick_test runs once with no persistence.

Five new endpoints on /api/agent-reviews/{rid}: GET / POST /runs, GET /runs/{id}, GET /runs/{id}/results (single batched traits query — no N+1), POST /tests/{tid}/quick-test. The page gains a Run All button, inline ⚡ Quick Test on the Tests tab, and a new Results tab with run selector, status breakdown bar, per-test Traits matrix, and result drawer.

Slice 3 — Compare / Settings / Logs

The last three placeholder tabs go live:

Compare — two-run selector joined on (test_id, execution_index), shows pass/fail per side + same / regression / improvement / changed chip.
Settings — name / description / judge_model / executions_per_test / tags via the existing PATCH endpoint.
Logs — "See trace" button in ResultDrawer opens TraceExplorerDrawer scoped to the test run.

The Compare/Settings slice also wires the trace plumbing: a new ContextVar in tool_executor (_test_run_ctx) binds test_run_id + execution_index into agent_tool_executions rows during a test run. Live chat is unchanged (NULL when unbound).

Trace Explorer (#15)

Walks data already captured in agent_tool_executions and renders it as a tree. New fields land (idempotent ALTER, all nullable): parent_exec_id, span_kind, tokens_in/out, cost_usd, test_run_id, execution_index.

/api/traces (router prefix) exposes:

GET /sessions/{id} — span tree (default) or flat
GET /test-runs/{run_id}/results/{result_id} — same, scoped to one test execution
GET /executions/{exec_id} — full payload (no truncation)

TraceExplorerDrawer slides in with kind chips, status dots, latency, expandable args/result blocks. Click-to-expand fetches the full payload. ConversationsPage gets a See trace button in the thread header.

Agent versioning (#16)

published_assets gains an active_version pointer separate from current_version (latest). Agents can be edited (which keeps bumping current_version) without changing what consumers serve. NULL preserves today's behavior; setting N pins serving to that historical asset_versions row.

New AgentVersionPicker dropdown in the AgentBuilderPage header replaces the static "Version vN" label. Foundation for Agent Review pinning runs to a version and for Agent Hub (Batch G #9) serving the active version rather than always-latest.

Bidirectional Chat ↔ Cobuild

Closes the loop between a dashboard's Build chat and the Cobuild dock — operators can now jump in both directions without losing context.

GET /api/cobuild/by-asset/{type}/{id} — most recent Cobuild session whose workstream produced this asset. Backs the Spawned by Cobuild pill in the Dashboard Build chat header.
POST /api/cobuild/escalate-from-dashboard/{dashboard_id} — spins a new Cobuild session seeded from the dashboard's chat transcript. Optional note body field appended verbatim; context_turns clamped to [1, 20].
GET /api/chats/related/{type}/{id} — escalation tree for an asset (primary chat + ancestor Cobuilds + descendant Cobuilds) in one round trip.
GET /api/chats/recent[?scope=] — unified view across dashboard_chat_messages, cobuild_messages, and chat_history. Paves the way for a future "option C" unified-history surface.

Frontend:

SpawnedByCobuildPill extracted as a shared component (block + inline variants) — dashboards now, agents and webapps next.
Cobuild dock rows carry a from <asset_type> chip when source_asset_type is set. The chip is clickable and navigates to the source asset; sourceAssetHref covers dashboard / recipe / agent.
New RelatedChatsPanel drops into a builder page's right rail and renders the asset's full escalation tree. Wired into DashboardDetailPage below the existing AssetReferencesPanel.
Recipe pages get the SpawnedByCobuildPill (inline variant) next to the recipe title.

Cross-org gating returns 404 (not 403) on /by-asset/ and /related/ so callers can't probe the asset-id space across tenants.

Dashboards — cross-filter cascade

Backend foundation for click-as-filter on dashboard pages. ExecuteAllRequest now accepts cross_filters: list[{column, value, values}] and an optional cross_filter_source_card_id (excluded from the cascade so the originating card always shows its full slice).

Per-card opt-in via card_config.cross_filter_columns. Cards without the opt-in are untouched. Identifier safety: column names must match _PARAM_NAME_RE fullmatch — any unvetted identifier silently skips the wrap rather than risking injection. IN-list selections supported for multi-pick bar/pie clicks. Frontend wiring is a follow-up.

Wire-filter apply — surface real errors

When _snapshot_dashboard raised mid-execute (e.g. a partial SQL statement already ran before the failure), the AsyncSession was left in a pending-rollback state. The try/except: pass caller swallowed the exception but did NOT rollback, so subsequent UPDATEs raised a generic InvalidRequestError and the frontend collapsed it to "Failed to apply". The backend now rolls back after the snapshot except block in both wire-filter apply and apply-suggestions, and a new extractApplyErrorMessage() surfaces HTTP status, structured 4xx details, and network-level err.message so the next prod failure is debuggable from the modal alone.

LDAP sync (RBAC Batch 1)

Per-org LDAP configuration + group-membership sync closes Phase 4 of the RBAC roadmap. groups.ldap_dn has been live since v0.0.55 but had no UI / no sync engine.

New table honeyframe.org_ldap_config (one row per org). Bind password is AES-GCM encrypted using the same key derivation as services.llm_credentials. Sync semantics in services/ldap_sync.py: for each group with ldap_dn IS NOT NULL, query LDAP for members, resolve via user_id_attribute (default mail) → match against honeyframe.users.email scoped to the caller's org. Reconciles user_groups: insert new, delete removed. Local groups (no ldap_dn) are never touched. Users in LDAP but not honeyframe are silently skipped — provisioning is out of scope.

Endpoints under /api/admin/ldap-config (org.admin only): GET, PUT, DELETE, POST /test (bind + small search, supports inline config for try-before-save), POST /sync. Frontend Settings page is a follow-up.

AI dataset descriptions

POST /api/datasets/ai-describe sends a dataset's schema + a small sample to the org's configured LLM and returns strict-JSON with a one-line table description plus per-column descriptions. apply=true writes them into honeyframe.datasets in one transaction; default is preview-only so the UI can show a review-then-apply flow (matches Dataiku's "Generate description" UX). Closes Databricks' "AI-generated table comments" parity gap.

Scenarios — python trigger

scenarios.trigger_type='python' lands. The scheduler evaluates a user-supplied expression every tick inside a strict AST sandbox; truthy result fires the scenario subject to the same 60s throttle as cron. Whitelisted builtins only — no imports, lambdas, or dunders. db_query(sql) accepts SELECT-only single statements; 10s wall-clock budget per evaluation. Closes the python-trigger gap in the competitive matrix.

Agent Reviews subsystem (#8)​

Slice 1 — Reviews + Tests CRUD​

Slice 2 — Runner + LLM-as-judge + Results​

Slice 3 — Compare / Settings / Logs​

Trace Explorer (#15)​

Agent versioning (#16)​

Bidirectional Chat ↔ Cobuild​

Dashboards — cross-filter cascade​

Wire-filter apply — surface real errors​

LDAP sync (RBAC Batch 1)​

AI dataset descriptions​

Scenarios — python trigger​