Data Policies
Honeyframe applies column-level masking to query results based on the column's classification, the user's role, and any masking policies attached to the dataset or organization. Masking is enforced post-query in Python so it works uniformly across every connector (PostgreSQL, Oracle, MySQL, MSSQL, etc.) without connector-specific SQL.
The masking engine lives at paas/backend/services/masking.py.
Concepts
A column has a semantic type, a sensitivity level, and a masking strategy. The platform classifies columns automatically based on column names matching known PII patterns; classifications can be overridden per dataset or per organization.
Sensitivity levels
| Level | Default treatment |
|---|---|
critical | Always masked unless the user holds an unmask role. |
high | Masked for non-admin viewers by default. |
medium | Masked for viewer-tier roles. |
low | Not masked. |
Masking strategies
| Strategy | Behavior |
|---|---|
partial | Show first/last few characters, mask the middle (john****@gmail.com). |
full | Replace the value with a fixed mask (***). |
hash | Replace with a deterministic hash so joins still work (sha256(value)[:12]). |
redact | Drop the value entirely (returns null). |
none | Pass through unchanged. |
The supported strategies are enumerated in STRATEGIES = ("partial", "full", "hash", "redact", "none").
Default classifications
The PII_DEFAULTS table seeds the engine with classifications for common Indonesian healthcare and personal-data fields:
| Semantic type | Default sensitivity | Default strategy | Unmask roles |
|---|---|---|---|
nik | critical | partial | admin |
email | high | partial | admin |
phone | high | partial | admin |
address | high | partial | admin |
name | medium | partial | admin |
Auto-classification kicks in when a query result column name matches a known semantic-type pattern (e.g. email, customer_email, email_address all map to email). Other columns default to no masking.
Resolution order
When the engine masks a query result, it resolves the rule for each column in this order — first match wins:
- Dataset-level override —
datasets.settings.masking[col_name]JSON object on the dataset record. - Org-level default —
organizations.data_policies.masking_defaults[semantic_type]JSON object on the org record. - Auto-classify default — the entry in
PII_DEFAULTS[semantic_type]. - No rule → no masking.
This means a dataset owner can promote a normally-masked column to none for a specific dataset (e.g. an analyst-facing aggregate view), and an organization admin can tighten or loosen the default for everyone.
Setting a dataset-level rule
Update the dataset's settings.masking field via the dataset settings UI or the /api/datasets/{dataset_id} endpoint:
{
"settings": {
"masking": {
"customer_email": {
"strategy": "hash",
"unmask_roles": ["admin", "cs_staff"]
},
"customer_phone": {
"strategy": "redact"
}
}
}
}
unmask_roles is a list of role strings that bypass the mask for this column. If omitted, the engine uses the default unmask roles from PII_DEFAULTS.
Setting an org-level default
Update the organization's data_policies.masking_defaults:
{
"data_policies": {
"masking_defaults": {
"email": {"strategy": "hash"},
"phone": {"strategy": "full"}
}
}
}
Org-level defaults override PII_DEFAULTS for every dataset in the org that does not have its own dataset-level rule.
Per-project unmask roles
Some installs need finer-grained control — e.g. a customer-service team that should see unmasked phone numbers only on the projects they're assigned to. The engine honors a unmask_project_roles field on the rule. If the user is a member of a project where their project-role is in unmask_project_roles, the column is unmasked just for queries scoped to that project.
{
"phone": {
"strategy": "partial",
"unmask_roles": ["admin"],
"unmask_project_roles": ["admin", "cs_staff"]
}
}
Where masking does (and doesn't) apply
Masking is enforced by the platform's SQL execution path (/api/chat, /api/datasets/{id}/explore, dashboard queries, and dataset preview). It is not enforced for:
- Direct database access — anyone with a Postgres connection string sees the unmasked rows. Treat the masking engine as a UI-layer protection, not a data-layer one.
- Raw connector exports — the
data_apipublishing surface does not run results through the masking engine. Sharing a dataset viadata_apiexposes the raw values. - Lakehouse Parquet files — files written by ingestion do not carry masking metadata. Anyone who can read the Parquet path sees the raw values.
- dbt model output — dbt runs against the source connector directly; transformation output is unmasked.
For data-layer enforcement, use database-side row security or a separate read replica with masked columns materialized at ingestion time.
Row-level filters
Row-level filtering is not yet implemented as a first-class platform feature. The standard approach is to:
- Define a dataset that includes only the rows a given audience should see (e.g.
WHERE org_id = :user_org). - Share that dataset with the audience instead of the underlying table.
- Use the dataset-level masking rules for column-level concerns.
A planned row_filters field on the dataset record will allow declarative row predicates ({"region_id": "{user.region_id}"}) — track the roadmap for the rollout window.
Cross-tenant isolation
v0.0.38 introduced opt-in Postgres RLS as a defense-in-depth layer on top of the application-level org checks. It addressed a class of cross-tenant leak the masking engine couldn't catch — direct SQL paths (/api/chat, dashboard execute-all, lineage queries) that returned rows from other orgs when an org_id filter was missed in the application code. The v0.0.38 release closed 17 such leaks in data_connectors plus 2 in pipelines.
What landed
- An
org_isolationmigration addsorg_idto every multi-tenant table that didn't already have it, plus an RLS policy that enforcesorg_id = current_setting('app.current_org_id')::uuidon eachSELECT/INSERT/UPDATE/DELETE. - The application sets
app.current_org_idper request via aSET LOCALat the start of the transaction; the GUC is cleared at request end. - An
org-accesshelper centralizes the lookup so handlers don't have to repeat the SET LOCAL boilerplate. - A test fixture mints JWTs with arbitrary
org_idvalues to exercise the RLS path without a real login flow.
Opt-in by default
RLS is scaffolding only in v0.0.38 — the policies are installed, the helper is available, but no org has the policies enabled by default. Enabling for a tenant is a two-step:
- Set
entitlements.rls.opt_in = trueon the org's license tier (see Per-org entitlements). - Run the
enable_rls_for_org(<org_id>)admin endpoint, whichALTERs each multi-tenant table to enable the policy for that specific org via a per-org policy clause.
Once flipped, every query the org runs is gated by the GUC. A handler that forgets to call set_org_context() returns zero rows instead of leaking another org's data — the RLS path errs toward empty, not toward open.
When to enable RLS
| Tenant shape | Recommendation |
|---|---|
| Single-tenant install (one org, all users) | Don't enable. Default-allow project visibility + masking is sufficient; RLS adds query overhead with no isolation benefit. |
| Single-tenant with strict departmental separation | Enable per-project access controls; RLS is unnecessary. |
| Multi-tenant SaaS (multiple orgs share a DB) | Enable. The 17+2 leak class is real, RLS catches the next leak even if the application code regresses. |
| Compliance-driven (HIPAA, GDPR data residency) | Enable. RLS is independently auditable in a way "we have application checks" isn't. |
Operational notes
- Migration is online. Adding
org_idcolumns and RLS policies does not lock tables — the platform runs the migration withCONCURRENTLYwhere supported. - Performance impact is ~1-3% on most queries (Postgres applies the GUC predicate as part of the WHERE clause and uses the existing
org_idindex). Joins across multi-tenant tables benefit from the same indexes the application already needs. - Backups and restores carry RLS policies along with the schema. Restoring to a fresh DB does not require re-running the migration.
- dbt builds run as the platform's service user, which the RLS policy can configure to bypass — dbt models get computed across all orgs the service user is allowed to see, then per-org masking is applied at query time.
Auditing masking decisions
The masking engine emits structured logs at INFO level for each query — column names, applied strategy, and reason (auto-classify / dataset-override / org-default). The logs are not stored in the audit table by default. To capture them, configure the application's logger to ship to your SIEM, or add log_audit(...) calls in the masking engine on the strategy-decision path.