Skip to main content
Version: Next

Flow

The Flow is the canvas where a project's datasets, recipes, and AI agents are arranged into a dependency graph. Most operator workflows start there: pick the dataset you want, see what feeds it, see what depends on it, and edit the recipe that produces it without leaving the canvas.

Conceptually it sits in the same family as Dataiku's Flow or Airflow's DAG view. Operationally it's a React Flow canvas backed by paas/backend/routers/flow_codegen.py (visual → dbt SQL) and paas/backend/routers/flow_ai.py (AI nodes).

Empty Flow canvas

After authoring recipes, the canvas auto-builds the DAG — every recipe block links to its upstream sources and downstream models. Nodes are colored by status (green = success, red = failed, amber = stale):

Flow with recipe and downstream model

Node types

NodeVisualMeaning
DatasetSquareA queryable dataset — source, intermediate, or mart.
Visual recipeRoundBlock-based recipe (prepare, filter, join, group_by, aggregate, window, formula, stack, and the newer pivot, top_n, sample, distinct, fuzzy_join, generate_statistics). Compiled to SQL.
SQL recipeRoundFree-form SQL model.
SQL Script recipeRoundMulti-statement SQL script — the author owns the DDL/DML.
Export recipeRoundTerminal sink — writes a dataset to a file (csv/parquet/xlsx) in a managed folder.
Python recipeRoundStandalone Python — opaque to dbt.
Notebook recipeRoundJupyter-style notebook recipe.
Sync recipeRoundStandalone ingestion node — pulls from a connector, writes to a dataset.
AI recipeRoundagent, embed, classify_text, summarize_text, parse_documents, knowledge-base feed.
dbt recipeRound (teal, "dbt" badge)Synthesized read-only node for a hand-written dbt model — see dbt models on the canvas.
FolderSquare (Infra group)A managed folder as a first-class flow input — see Folders as flow inputs.
ZoneGroup containerColor-coded grouping of nodes. Visual only — does not constrain execution.

Edges are drawn input → recipe → output. A recipe with multiple inputs (e.g. a join) has one edge per input.

Building a Flow

There are three ways nodes appear on the canvas:

  1. Add a connector and create a dataset. Each new dataset shows up as a leaf node with no upstream.
  2. Add a recipe via the + New Recipe button on a dataset, or by dragging a block from the Add Item palette. The recipe and its output dataset both appear.
  3. Drag-and-drop a standalone node — useful for Python, Notebook, Sync, and AI recipes that aren't tied to a single upstream.

Drag a node onto a Zone to assign it. Zones don't constrain execution; they're purely organizational.

The visual recipe builder

Visual recipes are the heart of the Flow. Click + New Recipe → Visual on any dataset and the right-side panel opens with the block palette. Each block translates to a piece of dbt SQL, which the platform compiles and writes to models/<layer>/<recipe_name>.sql on save.

Block types and what they produce:

BlockGenerated SQL
prepareSELECT col1, col2, CAST(col3 AS NUMERIC) AS col3, … — column rename, cast, drop, derive.
filter… WHERE col > value AND col2 IS NOT NULL — chained predicates with AND/OR conjunctions.
joinLEFT/INNER/RIGHT/FULL/CROSS JOIN other ON … — multi-key joins, prefix-based collision handling.
group_byGROUP BY col1, col2 paired with the next aggregate block.
aggregateSUM/COUNT/AVG/MIN/MAX/STDDEV/VARIANCE(col) AS alias.
windowROW_NUMBER/RANK/DENSE_RANK/SUM/AVG/COUNT/MIN/MAX OVER (PARTITION BY … ORDER BY …).
formulaArbitrary scalar SQL expression as a derived column.
stackUNION ALL / UNION of two model refs.
splitOne input → N outputs by predicate (v0.0.38). Each branch becomes a separate downstream model with its own filter clause. Use to fan out a cleaned dataset into per-segment marts without writing N near-identical recipes.
pivotLong → wide. Portable conditional aggregation: one SUM/COUNT(CASE WHEN spread_col = 'val' THEN measure END) column per enumerated value (Postgres has no native PIVOT). Spread values can be typed manually or auto-discovered with Fetch from data, which runs SELECT DISTINCT … LIMIT 100 against the governed preview endpoint.
top_nPer-group top-K. ROW_NUMBER() CTE partitioned by the group, ordered by the measure, filtered to rank_num <= N; global when no group is set.
sampleTake a row subset for fast development on large tables. first mode is LIMIT n (cheap); random mode is ORDER BY RANDOM() LIMIT n (uniform but sorts the whole input — warns on large tables).
distinctDeduplicate rows. With no columns: whole-row SELECT DISTINCT. With columns: SELECT DISTINCT ON (cols) keeping one row per combination, ordered by the keys plus an optional tiebreaker (e.g. updated_at DESC) so the surviving row is deterministic — warns when no tiebreaker is set.
fuzzy_joinJoin on approximate equality. Three methods: trigram (similarity() >= threshold via pg_trgm, default 0.7), levenshtein (<= distance via fuzzystrmatch), and exact_normalized (LOWER(TRIM(…)) =, no extension). Refuses a keyless predicate (cartesian guard) and always warns about row-explosion cost — run it on filtered/blocked inputs.
generate_statisticsColumn profiling. Emits one row per column: column_name, row_count, non_null_count, null_count, fill_rate, distinct_count, min_value, max_value. Type-agnostic (min/max cast to ::text), so a single UNION ALL works across numeric/text/date columns.
sqlPass-through — for steps the visual builder can't express.
syncRead from a connector → write to a target dataset. Not part of the dbt build; runs as a Python step.

Refs between visual blocks resolve to dbt's {{ ref('upstream_model') }}; refs to external sources resolve to {{ source('schema', 'table') }}. The canvas always has the live, resolved SQL one tab away (Preview SQL), so you can copy-paste into psql to debug.

The codegen layer refuses unsafe SQL by construction: a join (or fuzzy join) with no key pair is rejected at compile time rather than emitting a silent cartesian product.

POST /api/flow/preview-sql runs the generated SQL against the connector and returns a 100-row sample without persisting the recipe — useful for iterating before you save.

Reverse-engineering hand-written SQL

POST /api/flow/parse-to-visual attempts to reverse a dbt SQL model into block configs. It's best-effort: simple SELECT … FROM ref(...) WHERE … JOIN … patterns parse cleanly; complex CTEs, window functions inside qualify, and pivot patterns fall back to a single sql block.

Use it to bring legacy dbt projects under visual editing. Don't expect a round-trip to be byte-identical — saving the parsed blocks regenerates SQL in the platform's canonical style.

Build engines: dbt and native

Every org runs on one of two engine profiles, set at org creation (engine_profile: dbt | native) and visible to the SDK via GET /api/catalog/engine:

  • dbt (default) — visual and SQL recipes compile to dbt models; builds run through dbt run.
  • native — a dbt-less engine that runs the same recipes directly. A native org skips the dbt pip-install, profiles.yml, and the dbt-run timer entirely. Visual/SQL recipes are stored with target='sql' and Jinja-free inline_code, and the native engine materializes them honoring the author's view/table choice and target_schema.

The same recipes, lineage, and canvas work on both engines — most of this page is engine-agnostic. The SDK exposes Project.engine() / Project.uses_dbt() so automation can detect which it's talking to.

Native build (the native "dbt run")

POST /api/flow/build-native (project.edit, native orgs only) is the native equivalent of dbt run. It topologically sorts the project's target='sql' recipes by dataset I/O, then runs each in dependency order, stopping on the first failure. The plan resolver is pure, stable, and cycle-detecting (a cycle returns 400).

Three query options mirror dbt's most-used flags:

  • ?select= — partial rebuild of a model's subgraph using dbt selector syntax: model, model+ (and downstream), +model (and upstream), +model+ (full lineage), comma-separated for several. An unknown model is a 400 (a typo is surfaced loudly, not silently built as nothing).
  • ?background=true — resolves the plan synchronously (engine/selector/cycle errors still fail fast), records a native_build job run, then runs the ordered recipes on a fresh session and returns {run_id, status: "running"} immediately. Poll GET /api/jobs/{run_id} for progress — no new polling endpoint.
  • ?dry_run=true — resolves the plan (engine gate, selector, topo order) and returns {ordered, total, selected, dry_run: true} without touching the warehouse — the native dbt ls --select. Pairs with ?select= to preview exactly which recipes a partial rebuild would run, and in what order, before committing.

A GET /api/flow/recipes/native-readiness endpoint returns a ready/blocked verdict (with per-recipe buckets) for converting an org to native, and POST /api/flow/recipes/convert-to-sql bulk-converts existing recipes. Flipping an org to native (PUT /api/catalog/engine) 409s if any SQL-shaped recipe isn't native-runnable, unless ?force is passed; flipping back to dbt is unguarded.

Lenses ("Apply a view")

A lens recolors the whole canvas to surface a single dimension, selected from the Apply a view dropdown. Lenses are presentation-only — they never change the graph or trigger a build:

  • Cost lens — tints each node by total compute time (total_duration_sec from /jobs/summary) in a 5-stop blue→green→amber→orange→red heatmap. Thresholds are relative to the per-DAG max, so a fast pipeline doesn't paint everything red. The lens extends to recipe blocks (heat where compute actually lives), and the legend pill shows the DAG total and the slowest node. A 14-day cost-trend drawer and a top-3 "hot" glow highlight the most expensive nodes.
  • Tags lens — colors datasets by their first dbt tag.
  • Recipe Engines lens — colors recipes by the engine that runs them.
  • Schema lens — colors nodes by their output schema.

Running the Flow

Execution surfaces:

  • Single-node run — Right-click a recipe → Run. Runs only that recipe.
  • Subgraph run — Right-click a dataset → Build → choose upstream, downstream, or full lineage. Runs all the recipes needed to bring that dataset up-to-date.
  • Zone build — Right-click a zone → ▶ Build, or pick several zones for a single multi-zone build (one dbt run across all of them). See Zones.
  • Multi-select bulk run — Select several nodes and use the bulk-run bar to run them together.
  • Scheduled run — On the Schedules page, attach a cron expression to a build target. The scheduler picks it up.

Run progress streams to the canvas as colored borders on the running nodes (in-progress / queued-pulse / success / failure), with a live build-counter chip and a flow-level health chip. dbt builds stream per-model progress over SSE, so each node flips status as its model finishes rather than all at once. Click any node mid-run to see the live log. Each node persists its last-run latency and a 14-day duration sparkline.

Zones

Zones are color-coded rectangles that bundle related nodes. They group nodes for organization and as a build unit (see Zone build below); they have no effect on lineage or permissions. The platform ships with ten zone colors (orange, blue, purple, green, brown, red, amber, pink, indigo, gray) and lets you create custom zone templates.

Common patterns:

  • Bronze / Silver / Gold zones — three zones aligned with the Lakehouse data quality tiers. Raw ingest into Bronze, cleaned into Silver, modeled into Gold.
  • One vertical per Zone — for project-of-projects layouts, a Zone per business domain.
  • Sandbox zone — experimental nodes you don't want in the main lineage. Move them out when promoted.

Two assignment paths: drag-and-drop a node onto a zone, or use Auto-assign (POST /api/flow/zones/auto-assign) which groups nodes by layer and proximity.

The canvas layout (node positions) persists per-project via POST /api/flow/zone-layout. Without an explicit layout, nodes are auto-positioned at first render. Zones can be renamed and recolored inline on hover, and a per-zone header shows a health rollup.

Zone build

Unlike the original "purely organizational" model, a zone can now be built as a unit: right-click a zone → ▶ Build runs every recipe it owns. Pick several zones to run a single multi-zone build (one dbt run spanning all the selected zones). Each zone has a build-history drawer with a duration sparkline and an avg/success summary.

Zone shares (virtual references)

A node owned by zone A can be shared into zone B as a virtual reference — a "ghost" rendering — without moving the producing recipe or its storage. This cuts cross-zone arrow spaghetti in dense flows. The source zone retains build ownership: share-targets do not participate in build-zone selector resolution. Use Share to zone from the node's context menu.

EndpointDescription
GET /api/flow/zones/{id}/sharesNode IDs shared into this zone.
GET /api/flow/zones/shares{ node_id: [target_zone_id, …] } for the whole project, one fetch.
POST /api/flow/zones/{id}/sharesShare nodes into a zone (silently skips the owning zone).
DELETE /api/flow/zones/{id}/shares/{node_id}Unshare.

Canvas UX

The canvas got a sustained overhaul across these releases:

  • Recipe picker — categorized and searchable, with a Recent group and keyboard navigation. Right-click an empty area for a contextual create menu.
  • Layout — autosave node positions, a minimap with a status legend, live edge animation, a first-time tour, focus mode (with a keybind), and collapse-all-zones (Z).
  • At-a-glance node health — status colors, health chips, cost strips, and 14-day sparklines surface in hover popovers. Edges carry hover tooltips.
  • Editing affordances — handle hover affordances and drag feedback, an amber connection-draw line, queued-node pulse, and a "no input" warning on recipes wired with no upstream. Family colors group node types.
  • Search — a genie-style search jumps to nodes by name with a match counter.
  • Guardrails — saving a recipe warns when a materialized='view' model has incremental consumers; the output layer is auto-picked and tucked behind an Advanced toggle.

AI recipes

AI nodes appear on the same canvas as data recipes:

  • Agent recipe — references a configured agent (Agent Builder). Run it to invoke the agent against a dataset row stream.
  • Embed recipe — produces a knowledge-base index from a dataset's text column. Persists to the configured vector store (Connectors → Vector stores).
  • Knowledge base — appears as a node so you can see which agents and dashboards consume it.
  • LLM connector (v0.0.38) — chat-SQL on the canvas. Drop the LLM connector node, point it at a dataset, and ask in natural language ("which customers churned in Q3?") — the connector composes SQL against the dataset's schema and returns a result the downstream nodes can consume. Tenant-aware: schema and sample values stay within the calling org.

AI nodes share the run / schedule / log surface with data recipes; mixing them in the same subgraph build is supported.

Run button on bare-model dbt recipes (v0.0.38)

dbt recipes that only define a SELECT (no upstream visual blocks) used to require a full project build to refresh. The Run button on the recipe sidebar now invokes that single model directly via dbt run --select <model>, returning fresh output without rebuilding the rest of the project.

Document-AI recipes

Three recipes handle free text and scanned documents:

  • Classify Text — maps a text column to a category from a fixed label set (constrained prompt, temperature 0).
  • Summarize Text — produces a short summary of a text column, sized to a word budget.
  • Parse Documents (parse_documents) — transcribes a folder of PDFs/scans and optionally runs a free-form LLM key-value parse, materializing an auto-created {file, extracted_text, parsed_json} table. It consumes a folder input via its folder picker.

Classify and summarize are thin wrappers over the existing LLM-enrich primitive, so they inherit its model routing (openai/anthropic/connector), batching, and cost/trace plumbing.

Cobuild on the canvas

When Cobuild (the AI assistant) executes a plan, the canvas updates live: new nodes appear as the plan runs, freshly-added nodes flash, and a tool-call flash plays on the affected nodes. Cobuild is selection-aware — it acts on the nodes you have selected. From a node's right-click menu you can Ask Cobuild about it directly.

Folders as flow inputs

A managed folder is a first-class flow object, not just a recipe attachment. The lineage DAG emits every project managed folder as a folder:{id} node. Drop a Folder block from the Infra palette group to create a new managed folder (or jump to an existing one); it lands on the canvas on the next DAG refresh. Folders feed recipes such as parse_documents and export (the Export recipe's destination).

Project variables

Project-scoped, typed variables (standard scope plus an optional local override) are referenced as :name / ${name} in SQL recipes, SQL Script recipes, and scenario run_sql steps. Values bind as parameters, never string-interpolated, so they are injection-safe. Manage them under Project Settings → Variables, or via the SDK (project.list_variables / set_variable / delete_variable).

dbt models on the canvas

Hand-written dbt models used to render as bare dataset → dataset lines, because the transform lives inside the model SQL with no flow_recipes row. Two features now surface them in the Dataiku-style dataset → recipe → dataset shape — both read-only and presentation-only (nothing is written; dbt stays the source of truth):

  • Synthesized dbt recipe nodes — for every model with upstream deps and no real recipe, a synthetic sql recipe node is rendered between its inputs and the model (teal circle, dbt badge). Clicking it opens the model's SQL detail, never the editable recipe sidebar (there is no DB row to edit or delete).
  • SQL peek + edge chips — transforming models carry an always-on </> pill; hover for a read-only SQL preview popover, click to pin it, Open in editor → hands off to the full sidebar. Direct dataset → model edges show an ƒx chip on hover that opens the target model's SQL. SQL is fetched lazily from /lineage/model and cached per node.

Lineage and impact analysis

The Flow shows direct edges. For deeper analysis:

  • Lineage Explorer (Operations → Lineage) shows the full upstream/downstream subgraph for any node, with column-level granularity.
  • Column-level chain drill-down (v0.1.7) — in the column lineage graph, click a column row to open a drawer that renders the full source → staging → mart chain for that one column as a layered step list (layer label + lineage-type chip per step). Clicking a step re-centers the graph. Backed by GET /api/column-lineage/{model}/{column}/chain. dbt-only tenants see the backend's message rather than a crash.
  • Asset References panel — sidebar on every dataset / dashboard / recipe edit page. Lists downstream consumers of the asset — answers "what breaks if I change this?" without leaving the editor. Each consumer row carries a last_used_at age badge, and the columns drawn on by SQL recipes / dashboards / data APIs are tracked via a shared SQL column-reference scanner (v0.0.92–v0.0.93).

Under the hood, lineage now resolves through a pluggable catalog adapter layer (the dbt-decoupling work, v0.0.88) rather than reading the dbt manifest directly: a ModelCatalog abstraction backs lineage reads, dataset listing, governance/PII summaries, and the DAG, with a DbtManifestCatalog (manifest-backed, byte-identical for dbt tenants) and a NativeCatalog (backed by the honeyframe.datasets registry + information_schema columns + a first-class honeyframe.dataset_edges table). On native tenants, recipe runs and Flow Builder canvas saves auto-write dependency edges, so native orgs get DAG lineage without a manifest. The adapter set per org is configured via catalog_config — see Connectors → Catalog adapters.

Visual lineage from a recipe's block config (without compiling and running it) is available via POST /api/flow/visual-lineage.

API reference

EndpointDescription
GET /api/projects/{id}/flowCurrent node + edge state for a project.
POST /api/flow/generate-sqlCompile visual block config → dbt SQL.
POST /api/flow/parse-to-visualBest-effort SQL → block config.
POST /api/flow/preview-sqlRun generated SQL, return sample rows. Does not save.
POST /api/flow/visual-lineageColumn-level lineage from block config.
GET /api/flow/recipe/{model_name}Fetch the visual recipe for a dbt model.
POST /api/flow/save-recipe-stepsPersist visual block config as a dbt model.
POST /api/pipeline/runTrigger a build target.
POST /api/flow/build-nativeNative dependency-order build (native orgs). Supports ?select=, ?background=true, ?dry_run=true.
GET /api/catalog/engineOrg engine profile (dbt | native).
GET /api/flow/recipes/native-readinessReady/blocked verdict for converting an org to native.
POST /api/flow/recipes/convert-to-sqlBulk-convert recipes to native target='sql'.
GET /api/pipeline/runsList past runs with status, duration, log refs.
GET /api/flow/zones/sharesProject-wide zone-share map.
POST /api/flow/zones/{id}/sharesShare nodes into a zone (virtual reference).
GET /api/lineage/model/{name}Model SQL for canvas SQL-peek / dbt recipe nodes.
GET /api/lineage/{dataset}Full lineage subgraph with column-level edges.
GET /api/flow/zonesList zone templates.
POST /api/flow/zones/auto-assignAuto-assign nodes to zones.
POST /api/flow/zone-layoutPersist canvas positions.
GET /api/flow/recipe-templatesList reusable recipe templates.
POST /api/flow/python-recipe/runExecute a Python recipe.
POST /api/flow/ai-recipes/{id}/runExecute an AI recipe.

Performance characteristics

The canvas renders client-side. Node positions are stored on each row in the DB (not auto-laid-out on each load), which makes initial render O(n). For projects with < 200 nodes the canvas is interactive at 60 fps. Above ~500 nodes layout becomes the bottleneck — collapse zones, or use the Search sidebar to jump directly to a node by name.

Gotchas

  • Python and Notebook recipes are opaque to dbt — the Flow shows them as nodes and tracks their input/output edges, but they don't participate in dbt run's topological build. Schedule them separately.
  • Sync recipes are not dbt models — they are Python steps that move bytes from a connector to a dataset. Schedule them on the Schedules page; subgraph builds do not include them.
  • parse-to-visual is best-effort. Complex SQL falls back to a single sql block. Don't expect round-trip byte-identity.
  • AI recipes require backing rows. Agents must be published from Agent Builder, knowledge bases must be created on the Knowledge tab — adding the node on the Flow canvas references them, it does not create them.