Lewati ke konten utama
Versi: v0.1.7

Catalog & Lineage

The catalog is the metadata layer behind every surface that needs to know what datasets exist, what their columns are, and what depends on what. Dataset listing, column schemas, governance/PII summaries, the Flow DAG, and the cross-tool lineage walk all read through it. It used to read the dbt manifest directly; as of the dbt-decoupling work (v0.0.88) it goes through a pluggable ModelCatalog abstraction instead, so an org can resolve metadata from dbt, from the native registry, or from external tools — without that choice leaking into the consuming code.

This page covers the catalog abstraction, per-org configuration, the metadata adapters that ship, and the unified lineage walk that stitches lineage across those tools.

The catalog abstraction

ModelCatalog is a Protocol with a small surface — list_models(), get_columns(), get_upstreams(), get_model_by_uid(), and friends. Anything that needs metadata calls the catalog; it never touches the manifest or information_schema directly. Two implementations ship:

  • DbtManifestCatalog — backed by the project's dbt manifest. Byte-identical to the pre-decoupling behaviour for dbt tenants: models, columns, layers, tags, and materializations all come from manifest nodes.
  • NativeCatalog — backed by the honeyframe.datasets registry for model existence, information_schema.columns for schemas, and a first-class honeyframe.dataset_edges table for dependency lineage. This is what dbt-less ("native") tenants run on.

Because the consumers (/datasets, /datasets/available, /datasets/import, the governance summary, the data dictionary, the pipeline-funnel and KPI materializations) all read through the Protocol, the same UI works on both engines. A dbt model's materialization comes from the manifest; a native model leaves it None. Column counts come from manifest node.columns on dbt and from information_schema on native.

Per-org configuration

The catalog is configured per org, so two orgs on one install can resolve metadata from different sources.

Adapter list (catalog_config)

Each org carries a catalog_config JSONB column listing its metadata adapters. The default is dbt-only ([{"adapter":"dbt"}]); a NULL, empty, or non-list value falls back to that default, so the catalog always has something deterministic to iterate.

  • GET /api/catalog/config (project.view) — the org's adapter list.
  • PUT /api/catalog/config (org.admin) — replace the adapter list. Unknown adapter names and duplicate instance_name values are rejected with 400 at the API boundary rather than failing later at instantiation. A successful write logs a catalog.config.updated audit row so compliance can reconstruct who changed which metadata source when.

See Connectors → Catalog adapters for the connector-side view of the same configuration.

Engine profile (dbt | native)

Separately from the adapter list, each org has an engine profile that selects which ModelCatalog implementation backs it and how recipes build:

  • GET /api/catalog/engine / PUT /api/catalog/engine (org.admin) — read or flip the engine profile.

Flipping to native is guarded: the call returns 409 if any SQL-shaped recipe in the org isn't native-runnable (pass ?force to override). Flipping back to dbt is unguarded. See Flow → Build engines for what the native engine does at build time. The engine flip is also surfaced as a toggle in the Migration Cockpit so a tenant can opt into the native catalog without a direct DB change.

Native lineage

On native tenants there is no manifest to read dependency edges from, so the platform writes them as data builds happen. Two paths populate honeyframe.dataset_edges:

  • Recipe runs — the recipe runner auto-writes depends_on edges for each run's input/output datasets.
  • Flow Builder canvas saves — saving a recipe (including multi-input / multi-output recipes) writes its lineage edges from the canvas wiring.

The result is that a native org gets a working DAG and upstream/downstream lineage without a dbt manifest. Edges can also be written directly:

  • POST / DELETE /api/datasets/{name}/lineage/upstreams — body {upstream_names: [...]}. Each name is resolved to a dataset within the caller's project and inserted as a depends_on edge with ON CONFLICT DO NOTHING (idempotent). Names that don't resolve come back in skipped[] rather than 4xx-ing — partial writes are the right semantic for callers rebuilding lineage across renames.

Metadata adapters

Beyond dbt and the native registry, the catalog can pull metadata and lineage from external data tools. Each adapter publishes its assets and edges into the same model so the lineage walk can stitch across them. Adapters are addressed by a four-part fully-qualified name (FQN)<tool>.<db>.<schema>.<table> — and the leading tool prefix tells the walk which adapter resolves each hop.

AdapterTool prefixModeNotes
dbtdbt.manifestModels, columns, lineage; column-level via sqlglot.
Dataiku DSSdataiku.live (REST)Walks the DSS public REST API for projects, datasets, and flow edges. SQL-recipe column lineage.
Kafkakafka.live (REST)Schema Registry walker (subjects → assets); optional Kafka Connect REST walker for connector ↔ topic edges.
SSISssis.live (file)Parses checked-out .dtsx XML packages.
Informaticainformatica.live (REST)IICS / IDMC: mappings and mtTasks as assets; per-mapping column lineage.
Oracle GoldenGateogg.live (file)Parses .prm parameter files into extract/replicate edges.
S3s3.live (storage)Bucket walk with Hive-partition collapse.

Most adapters that started as file-inventory listings grew a live extraction mode in catalog Phase 4 (Dataiku, Kafka, Informatica, SSIS), so the same registry name works in either mode and operators flip an adapter to live by setting its config.

How each adapter resolves lineage

  • dbt — column-level lineage is extracted from each model's SQL via sqlglot. The parser produces a column_map ({out_col: [(upstream_fqn, upstream_col), ...]}) given a SQL string and an upstream resolver; this is pure SQL parsing with no adapter coupling, so any adapter carrying SQL text reuses it. A column whose origin can't be parsed is left None (null = unknown).
  • Dataiku DSSget_lineage fetches each SQL recipe's body and runs it through the shared column-lineage extractor, attaching a column map per (input, output) edge — the same contract as dbt.
  • Informatica — per-mapping column lineage: cross-joins sources × targets into edges (IICS mappings can fan out), and resolves a mapping's SQL override through the shared column extractor when one is present.
  • Kafka — the Schema Registry walker emits topic/subject assets but no edges (topic-to-topic lineage lives in Connect/KSQL, not the registry). When a connect_url is configured, the Connect walker synthesises connector ↔ topic edges: topic → connector for sinks and connector → topic for sources, both REPLICATE.
  • Oracle GoldenGate — the .prm walker emits one asset per declared table and turns parameter statements into edges: each TABLE (EXTRACT) and MAP (REPLICAT) statement becomes a source → target REPLICATE edge.
  • S3 — emits one asset per logical table after collapsing Hive partitions, so a partitioned path is one asset rather than thousands. get_lineage returns []: S3 is a storage layer, and the tools that read/write it (a Dataiku S3 recipe, dbt-spark, the OGG bigdata adapter) own the edges that touch s3.* FQNs.

Cross-tool FQN stitching

The lineage walk stitches one tool's output into another's input when both sides emit the same FQN. The key mechanism is vendor-prefix rewriting: GoldenGate's OGG adapter accepts source_vendor / target_vendor config and emits FQNs like oracle.<db>.<schema>.<table> instead of its OGG-local namespace. An Oracle catalog adapter pointed at the same database publishes identical FQNs, so the walk stitches the OGG → Oracle hop with no glue code. The same trick works for heterogeneous deployments (e.g. Oracle → Kafka). When the vendor config is absent, the adapter falls back to its tool-local namespace silently.

The lineage walk

GET /api/catalog/lineage/walk?root_fqn=… returns a combined lineage graph stitched across every configured adapter instance. It pivots on the FQN's leading tool prefix to decide which adapter resolves each hop, then recurses through cross-tool edges, so a single graph can span a Dataiku recipe, an SSIS package, a Kafka topic, and a dbt model.

Query options:

  • ?tools= — scope the walk to named adapters, e.g. ?tools=dbt,ssis. Edges that reach a filtered-out adapter are still drawn at the perimeter, but recursion stops there. The root FQN's own tool is always allowed regardless of the filter — handy for a clean two-system slice in a demo.
  • direction / depth — the walk supports upstream/downstream direction and a depth bound; the lineage graph UI exposes a 1 / 2 / 3 / … / All depth picker.
  • column scoping — when an adapter attaches a column map to an edge, the walk carries it through, so column-level lineage (e.g. customer_name ← name across a Dataiku → dbt edge) renders inline.

Per-adapter get_lineage failures are logged and skipped, so one broken adapter (a half-broken DSS, an unreachable registry) degrades that adapter's health rather than blanking the whole graph. The layout is deterministic: the same FQNs in produce the same picture out.

In the UI

The combined lineage walk surfaces in the Migration Cockpit: clicking Lineage on a feed row opens a side drawer that calls /api/catalog/lineage/walk and renders the stitched graph. The drawer has a Graph / List toggle — the SVG graph view colours each node by tool prefix (dbt blue, Dataiku amber, Kafka pink, SSIS indigo, Informatica green, OGG purple), and clicking a node isolates its lineage path. Column maps expand inline under each edge. The list view keeps the raw edge tables for anyone who prefers them.

How it connects to other surfaces

  • Flow — the canvas DAG, the Lineage Explorer, and the column-level chain drill-down all read through the catalog. On native tenants this is what makes the DAG work without a manifest.
  • Datasets — dataset existence, columns, layers, and the upstream/downstream counts on the detail page come from the catalog.
  • SDKDataset.upstream() / Dataset.downstream() walk recursive ancestors/dependents, and Project.dag() returns {nodes, edges}. These read through the model catalog, so they work identically on dbt and native tenants — a script can answer "what feeds this / what breaks if I change it" without the UI.
  • Cobuild — the AI assistant's get_lineage tool walks the cross-tool lineage graph for an asset and renders it inline as a lineage card.

API reference

EndpointMethodPurpose
/api/catalog/configGETThe org's metadata adapter list.
/api/catalog/configPUTReplace the adapter list (org.admin). Validates unknown names and duplicate instance_name.
/api/catalog/engineGETThe org's engine profile (dbt | native).
/api/catalog/enginePUTFlip the engine profile (org.admin). →native 409s on un-ready recipes unless ?force.
/api/catalog/lineage/walkGETCross-adapter lineage graph from ?root_fqn=. Supports ?tools=, direction, and depth.
/api/catalog/assetsGETCached catalog assets across configured adapters.
/api/datasets/{name}/lineage/upstreamsPOST / DELETEWrite/remove native dependency edges ({upstream_names: [...]}, idempotent).

Gotchas

  • Adapter config validation is at the boundary. An unknown adapter name or a duplicate instance_name in catalog_config is a 400 on PUT, not a runtime failure — fix the config, not the catalog.
  • A tools= filter still shows perimeter edges. Filtered-out adapters appear as graph boundaries; recursion just stops there. The root FQN's tool is never filtered out.
  • S3 emits no edges of its own. S3 assets only appear in lineage because an upstream/downstream tool (Dataiku, dbt-spark, OGG) emits the edges touching s3.* FQNs. With no such tool configured, an S3 asset is an island.
  • Cross-tool stitching needs matching FQNs. The OGG → Oracle (or Oracle → Kafka) hop only stitches when both adapters emit the same vendor-prefixed FQN. Without source_vendor / target_vendor set, OGG stays in its tool-local namespace and the hop won't stitch.
  • Column lineage is best-effort. sqlglot parses what it can; an edge whose column origin can't be resolved carries None, not a guess. Kafka Schema Registry and the SSIS/Informatica file modes may emit assets without column-level edges.