Skip to content

Snowflake: Mattermost Data Warehouse (dbt)

Metric Count
Total lineage found by LLM 46
Total lineage found by DFA 0
DFA matches (correct) 0
DFA crashes (lineage missed) 0
DFA silent misses 46
DFA robustness issues 1
Accuracy 0%

Source Repository: mattermost-data-warehouse MantisBT: #4387 Date: 2026-04-03


Performance Comparison

Metric LLM DFA
Wall-clock time 55s 1s
Token usage N/A N/A
Agent count 2 parallel agents 1 process
Approach Read & understand SQL + Jinja Parse AST & trace dataflow

Notes: - DFA completed in 1 second because it immediately failed on all Jinja-templated SQL files. - LLM analyzed 47 dbt model files and correctly identified 46 lineage relationships. - This is a dbt project — SQL files contain Jinja templates ({{ ref() }}, {{ source() }}, {% if is_incremental() %}).


Root Cause: dbt/Jinja Incompatibility

DFA found zero lineage because dbt SQL files are incompatible with DFA in two ways:

1. Raw dbt SQL contains Jinja templates (not valid SQL)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
-- Example: models/marts/product/fct_active_servers.sql
SELECT
    server.*,
    {{ dbt_utils.star(from=ref('int_server_license_daily'), ... ) }}
FROM {{ ref('int_server_active_days_spined') }} AS server
LEFT JOIN {{ ref('int_server_license_daily') }} AS license
    ON server.server_id = license.server_id
{% if is_incremental() %}
WHERE server.activity_date > (SELECT MAX(activity_date) FROM {{ this }})
{% endif %}

DFA's Snowflake parser cannot tokenize {{, }}, {%, %}, or Jinja expressions. Every file fails at the tokenizer level with no output.

2. Compiled dbt SQL lacks DDL context

dbt's compiled output (in target/compiled/) renders Jinja to pure SQL, but the result is a bare SELECT statement with no CREATE VIEW or INSERT INTO wrapper:

1
2
3
4
-- target/compiled/.../server_daily_details.sql
SELECT server_id
FROM "ANALYTICS".mattermost.server_daily_details
WHERE server_id IS NULL

DFA found 342 table references across compiled SQL but traced zero lineage relationships because there is no target object (no CREATE VIEW, INSERT INTO, or CTAS wrapping the SELECT).


All 46 LLM-Detected Lineage Relationships (DFA: 0 of 46)

Product Marts (28 models)

# Target Model Source Models Type
1 dim_board_customers _int_server_installation_id_bridge, stg_stripe__subscriptions, stg_stripe__customers, int_boards_active_days_spined view
2 fct_board_activity int_boards_active_days_spined table
3 fct_calls_activity int_calls_active_days_spined, int_server_hosting_type table
4 grp_copilot_tracks stg_copilot__tracks table
5 dim_daily_license int_server_license_daily, int_known_licenses view
6 dim_daily_server_config int__config_oauth, int__config_ldap, int__config_saml, int__config_plugin, int__config_service view
7 dim_daily_server_info int_server_active_days_spined view
8 dim_excludable_servers int_excludable_servers view
9 dim_installation_summary int_server_active_days_spined view
10 dim_server_info int_user_active_days_spined, int_server_telemetry_summary, int_server_license_daily, int_known_licenses, int_server_hosting_type, int_server_ip_to_country, stg_stripe__subscriptions, stg_stripe__customers view
11 dim_version int_server_active_days_spined view
12 dim_ip_daily_summary stg_releases__log_entries incremental
13 fct_downloads stg_releases__log_entries incremental
14 fct_active_servers int_server_active_days_spined, int_server_license_daily view
15 fct_active_users int_user_active_days_spined, int_server_active_days_spined view
16 fct_feature_daily_snapshot int_feature_daily_spine, telemetry_days, int_server_active_days_spined view
17 bdg_license_server int_server_license_daily view
18 dim_latest_server_customer_info int_latest_server_customer_info view
19 dim_license_customers int_active_licenses view
20 dim_license_type stg_stripe__products view
21 fct_active_production_licenses int_active_licenses view
22 fct_licenses int_active_licenses view
23 dim_known_licenses int_known_licenses view
24 fct_notification_stats int_notifications_eu_hourly, int_notifications_us_hourly, int_notifications_test_hourly, int_notifications_logs_eu_hourly, int_notifications_logs_us_hourly, int_notifications_logs_test_hourly view
25 fct_nps_feedback int_nps_feedback, int_nps_score view
26 fct_nps_score int_user_nps_score_spined, int_nps_server_version_spined view
27 rpt_tedau_at_day_28 fct_active_users, dim_daily_server_info, dim_server_info view
28 fct_trial_request_history int_cloud_trial_requests_history, int_onprem_trial_requests_history, stg_salesforce__lead view

Sales Marts (7 models)

# Target Model Source Models Type
29 fct_subscription_history stg_cws__subscription_history, stg_stripe__subscriptions, stg_stripe__products view
30 dim_self_serve_customers stg_stripe__customers view
31 fct_in_product_trial_requests stg_cws__trial_requests, country_codes, stg_salesforce__lead, stg_salesforce__campaign_member, int_onprem_trial_license_information view
32 fct_cloud_trial_requests int_cloud_trial_requests, stg_salesforce__lead, stg_salesforce__campaign_member view
33 sync_opportunity_update_ending_arr int_opportunity_ext, stg_salesforce__opportunity, int_opportunity_daily_arr view
34 sync_account_arr_and_type stg_salesforce__account, int_account_arr_seats view
35 sync_lead_account_link stg_salesforce__lead, stg_salesforce__account view

Release Marts (5 models)

# Target Model Source Models Type
36 fct_issues_daily_snapshot stg_mattermost_jira__issues, stg_mattermost__version_release_dates view
37 dim_labels stg_mattermost_jira__issues view
38 dim_releases stg_mattermost__version_release_dates view
39 dim_fix_versions stg_mattermost_jira__issues, stg_mattermost__version_release_dates view
40 dim_projects stg_mattermost_jira__projects view

Common Marts (2 models)

# Target Model Source Models Type
41 dim_user_agent int_user_agents incremental
42 dim_date telemetry_days view

Other Marts (4 models)

# Target Model Source Models Type
43 dim_events int_events_aggregated_to_date view
44 fct_events_daily_snapshot int_events_aggregated_to_date view
45 fct_security_newsletter stg_cws__marketing, stg_salesforce__lead, stg_salesforce__campaign_member view
46 fct_onprem_trial_requests_history stg_cws__trial_requests, stg_salesforce__lead view

Robustness Issues

Issue 1: JSON SQL Environment parse failures

DFA's TJSONSQLEnvParser throws RuntimeException: Unknown JSON value type when processing dbt project directories. The dbt project contains .json and .yml files (e.g., manifest.json, schema.yml) that the SQL environment parser attempts to read but cannot handle. This generates warning-level stack traces but does not crash the process.

Issue 2: Compiled SQL syntax errors (8 files)

When run against compiled dbt SQL (in target/compiled/), 8 of 251 files had syntax errors. These are primarily in files using Snowflake-specific syntax patterns that the tokenizer doesn't handle (e.g., calls_events.sql with nested WITH clauses).


Side-by-Side: Where Each Approach Excels

Summary

Dimension Winner Evidence Impact
Column-level precision N/A DFA found 0 relationships; cannot compare High
Multi-source expressions N/A No DFA output to evaluate N/A
FK/ER relationships N/A No DDL with FK constraints in dbt models N/A
fdd vs fdr distinction N/A DFA found 0 relationships N/A
Novel/vendor-specific syntax LLM Jinja templates, dbt macros, incremental strategies Critical
Semantic understanding LLM Understood ref()/source() as model dependencies Critical
Graceful degradation LLM DFA produced 0 output; LLM found all 46 relationships Critical
Aggregate/string functions N/A No DFA output to evaluate N/A
Trigger/procedure bodies N/A No triggers/procedures in dbt models N/A

Detail

DFA Advantages in This Codebase

None. DFA cannot process dbt SQL in any form: - Raw SQL with Jinja: tokenizer fails on {{, {% - Compiled SQL without Jinja: no DDL context (bare SELECT, no CREATE VIEW)

If DFA were provided with properly wrapped SQL (e.g., CREATE VIEW model_name AS <compiled_select>), it would likely produce column-level lineage for many of these models. But that pre-processing step is not part of the current DFA workflow.

LLM Advantages in This Codebase

  • Jinja comprehension: LLM read through {{ ref('model_name') }} and {{ source('schema', 'table') }} to understand model dependencies without any template pre-processing
  • dbt conventions: LLM understood that each .sql file IS the model definition, the filename is the target name, and ref() calls are upstream dependencies
  • Incremental logic: LLM identified {% if is_incremental() %} blocks and understood the delete+insert strategy
  • Macro expansion: LLM understood {{ dbt_utils.star() }}, {{ dbt_utils.pivot() }}, {{ dbt_utils.generate_surrogate_key() }} as column-generating macros
  • Architecture: LLM identified the staging/intermediate/marts layered pattern, star schema design, and multi-source integration

Tie / Both Correct

  • Both approaches correctly identified that dbt project directories contain non-SQL files — DFA warned about JSON parse failures, LLM skipped non-SQL files
  • Neither approach can trace lineage through dynamic SQL or runtime-determined table names

Recommendations

  1. dbt pre-processor for DFA: Implement a dbt-aware pre-processing step that:
  2. Runs dbt compile to generate rendered SQL (or reads from target/compiled/)
  3. Wraps each compiled model in CREATE VIEW {model_name} AS ... using metadata from manifest.json
  4. Feeds the wrapped SQL to DFA for column-level lineage analysis

  5. Jinja-aware tokenizer: As a lighter alternative, add a pre-pass to the Snowflake lexer that strips or replaces Jinja expressions with SQL-safe placeholders (e.g., {{ ref('foo') }} -> foo).

  6. manifest.json parser: dbt's manifest.json contains the complete dependency graph. DFA could parse this file to obtain model-level lineage without touching SQL at all, then use compiled SQL for column-level detail.

  7. No parser fixes needed for core SQL: The 8 syntax errors in compiled SQL are minor (nested WITH in specific Snowflake patterns) and affect only 3% of compiled files.