Lewati ke konten utama
Versi: v0.0.81

v0.0.70 — Agent Reviews, Trace Explorer, Chat↔Cobuild

Released: 2026-05-15. Twenty-three commits.

This is the first big Dataiku-parity batch — Agent Reviews (capability #8), Trace Explorer (#15), Agent Versioning (#16), and the start of bidirectional Chat ↔ Cobuild navigation.

Agent Reviews subsystem (#8)

A complete review-and-judge workflow for agents lands across three slices in this release.

Slice 1 — Reviews + Tests CRUD

Schema for the full subsystem (agent_reviews, agent_test_cases, agent_test_runs, agent_test_results, agent_test_traits) ships now so slice 2's runner has somewhere to write; only Reviews + Tests endpoints and UI are wired in this slice.

AgentReviewsPage lists reviews with search, favorite / agent filters, and a "+ New" modal. AgentReviewPage renders a header + 6 tabs; the Tests tab is fully wired (table + drawer for prompt / reference / expectations). The other 5 tabs are explicit placeholders pointing to slice 2.

Slice 2 — Runner + LLM-as-judge + Results

agent_judge runs two judge prompts in parallel — reference semantic-equivalence + expectations rule-check. Temperature 0, skip on empty inputs, neutral on no grading dimension. combine_overall_status() rolls verdicts → pass / fail / error / neutral.

agent_test_runner.run_all_tests fans out N executions per test, snapshots agent_version onto the run row, persists results + traits, and routes agent invocation through generate_knowledge_response so connector / tools / RAG mirror the live chat path. quick_test runs once with no persistence.

Five new endpoints on /api/agent-reviews/{rid}: GET / POST /runs, GET /runs/{id}, GET /runs/{id}/results (single batched traits query — no N+1), POST /tests/{tid}/quick-test. The page gains a Run All button, inline ⚡ Quick Test on the Tests tab, and a new Results tab with run selector, status breakdown bar, per-test Traits matrix, and result drawer.

Slice 3 — Compare / Settings / Logs

The last three placeholder tabs go live:

  • Compare — two-run selector joined on (test_id, execution_index), shows pass/fail per side + same / regression / improvement / changed chip.
  • Settings — name / description / judge_model / executions_per_test / tags via the existing PATCH endpoint.
  • Logs — "See trace" button in ResultDrawer opens TraceExplorerDrawer scoped to the test run.

The Compare/Settings slice also wires the trace plumbing: a new ContextVar in tool_executor (_test_run_ctx) binds test_run_id + execution_index into agent_tool_executions rows during a test run. Live chat is unchanged (NULL when unbound).

Trace Explorer (#15)

Walks data already captured in agent_tool_executions and renders it as a tree. New fields land (idempotent ALTER, all nullable): parent_exec_id, span_kind, tokens_in/out, cost_usd, test_run_id, execution_index.

/api/traces (router prefix) exposes:

  • GET /sessions/{id} — span tree (default) or flat
  • GET /test-runs/{run_id}/results/{result_id} — same, scoped to one test execution
  • GET /executions/{exec_id} — full payload (no truncation)

TraceExplorerDrawer slides in with kind chips, status dots, latency, expandable args/result blocks. Click-to-expand fetches the full payload. ConversationsPage gets a See trace button in the thread header.

Agent versioning (#16)

published_assets gains an active_version pointer separate from current_version (latest). Agents can be edited (which keeps bumping current_version) without changing what consumers serve. NULL preserves today's behavior; setting N pins serving to that historical asset_versions row.

New AgentVersionPicker dropdown in the AgentBuilderPage header replaces the static "Version vN" label. Foundation for Agent Review pinning runs to a version and for Agent Hub (Batch G #9) serving the active version rather than always-latest.

Bidirectional Chat ↔ Cobuild

Closes the loop between a dashboard's Build chat and the Cobuild dock — operators can now jump in both directions without losing context.

  • GET /api/cobuild/by-asset/{type}/{id} — most recent Cobuild session whose workstream produced this asset. Backs the Spawned by Cobuild pill in the Dashboard Build chat header.
  • POST /api/cobuild/escalate-from-dashboard/{dashboard_id} — spins a new Cobuild session seeded from the dashboard's chat transcript. Optional note body field appended verbatim; context_turns clamped to [1, 20].
  • GET /api/chats/related/{type}/{id} — escalation tree for an asset (primary chat + ancestor Cobuilds + descendant Cobuilds) in one round trip.
  • GET /api/chats/recent[?scope=] — unified view across dashboard_chat_messages, cobuild_messages, and chat_history. Paves the way for a future "option C" unified-history surface.

Frontend:

  • SpawnedByCobuildPill extracted as a shared component (block + inline variants) — dashboards now, agents and webapps next.
  • Cobuild dock rows carry a from <asset_type> chip when source_asset_type is set. The chip is clickable and navigates to the source asset; sourceAssetHref covers dashboard / recipe / agent.
  • New RelatedChatsPanel drops into a builder page's right rail and renders the asset's full escalation tree. Wired into DashboardDetailPage below the existing AssetReferencesPanel.
  • Recipe pages get the SpawnedByCobuildPill (inline variant) next to the recipe title.

Cross-org gating returns 404 (not 403) on /by-asset/ and /related/ so callers can't probe the asset-id space across tenants.

Dashboards — cross-filter cascade

Backend foundation for click-as-filter on dashboard pages. ExecuteAllRequest now accepts cross_filters: list[{column, value, values}] and an optional cross_filter_source_card_id (excluded from the cascade so the originating card always shows its full slice).

Per-card opt-in via card_config.cross_filter_columns. Cards without the opt-in are untouched. Identifier safety: column names must match _PARAM_NAME_RE fullmatch — any unvetted identifier silently skips the wrap rather than risking injection. IN-list selections supported for multi-pick bar/pie clicks. Frontend wiring is a follow-up.

Wire-filter apply — surface real errors

When _snapshot_dashboard raised mid-execute (e.g. a partial SQL statement already ran before the failure), the AsyncSession was left in a pending-rollback state. The try/except: pass caller swallowed the exception but did NOT rollback, so subsequent UPDATEs raised a generic InvalidRequestError and the frontend collapsed it to "Failed to apply". The backend now rolls back after the snapshot except block in both wire-filter apply and apply-suggestions, and a new extractApplyErrorMessage() surfaces HTTP status, structured 4xx details, and network-level err.message so the next prod failure is debuggable from the modal alone.

LDAP sync (RBAC Batch 1)

Per-org LDAP configuration + group-membership sync closes Phase 4 of the RBAC roadmap. groups.ldap_dn has been live since v0.0.55 but had no UI / no sync engine.

New table honeyframe.org_ldap_config (one row per org). Bind password is AES-GCM encrypted using the same key derivation as services.llm_credentials. Sync semantics in services/ldap_sync.py: for each group with ldap_dn IS NOT NULL, query LDAP for members, resolve via user_id_attribute (default mail) → match against honeyframe.users.email scoped to the caller's org. Reconciles user_groups: insert new, delete removed. Local groups (no ldap_dn) are never touched. Users in LDAP but not honeyframe are silently skipped — provisioning is out of scope.

Endpoints under /api/admin/ldap-config (org.admin only): GET, PUT, DELETE, POST /test (bind + small search, supports inline config for try-before-save), POST /sync. Frontend Settings page is a follow-up.

AI dataset descriptions

POST /api/datasets/ai-describe sends a dataset's schema + a small sample to the org's configured LLM and returns strict-JSON with a one-line table description plus per-column descriptions. apply=true writes them into honeyframe.datasets in one transaction; default is preview-only so the UI can show a review-then-apply flow (matches Dataiku's "Generate description" UX). Closes Databricks' "AI-generated table comments" parity gap.

Scenarios — python trigger

scenarios.trigger_type='python' lands. The scheduler evaluates a user-supplied expression every tick inside a strict AST sandbox; truthy result fires the scenario subject to the same 60s throttle as cron. Whitelisted builtins only — no imports, lambdas, or dunders. db_query(sql) accepts SELECT-only single statements; 10s wall-clock budget per evaluation. Closes the python-trigger gap in the competitive matrix.