Skip to content

DuckDB: SOMA B2B SaaS (dbt)

Metric Count
Total lineage found by LLM 33
Total lineage found by DFA 0
DFA matches (correct) 0
DFA crashes (lineage missed) 0
DFA silent misses 33
DFA robustness issues 2
Accuracy 0%

Source Repository: SOMA-B2B-SaaS MantisBT: #4387 (shared with Snowflake dbt ticket — same root cause) Date: 2026-04-03


Performance Comparison

Metric LLM DFA
Wall-clock time 79s 4s
Token usage N/A N/A
Agent count 2 parallel agents 1 process
Approach Read & understand SQL + Jinja Parse AST & trace dataflow

Project Overview

SOMA (SaaS Operational Metrics & Analytics) is a demo/sample dbt project targeting DuckDB. It models the complete B2B SaaS customer lifecycle using an event-stream architecture:

  • 115 model files (all use Jinja templates)
  • 82 activity models — generate synthetic data via generate_fake_data() macro (no upstream dependencies)
  • 33 models with actual lineage — reference other models via ref()
  • Architecture: Activity streams -> Growth accounting metrics -> Consolidated cube

Root Cause: dbt/Jinja + DuckDB Incompatibility

Same root cause as Snowflake/mattermost-data-warehouse:

  1. Raw dbt SQL (with Jinja): DFA cannot tokenize {{, }}, {%, %} — all 115 files fail to parse
  2. Compiled dbt SQL (no Jinja): DFA found 14 table structures but 0 lineage relationships — bare SELECT without CREATE VIEW wrappers
  3. DuckDB-specific syntax: Compiled SQL uses DuckDB functions (generate_series, to_seconds, unnest) that the PostgreSQL parser doesn't support — 18 syntax errors on compiled files

All 33 LLM-Detected Lineage Relationships (DFA: 0 of 33)

Base Metrics (12 models) — all ref client_activity_stream + customer

# Target Model Source Models Type
1 ga_cube_new_revenue client_activity_stream, customer table
2 ga_cube_new_subscriptions client_activity_stream, customer table
3 ga_cube_expansion_revenue client_activity_stream, customer table
4 ga_cube_expansion_subscriptions client_activity_stream, customer table
5 ga_cube_contraction_revenue client_activity_stream, customer table
6 ga_cube_contraction_subscriptions client_activity_stream, customer table
7 ga_cube_churned_revenue client_activity_stream, customer table
8 ga_cube_churned_subscriptions client_activity_stream, customer table
9 ga_cube_resurrected_revenue client_activity_stream, customer table
10 ga_cube_resurrected_subscriptions client_activity_stream, customer table
11 ga_cube_retained_revenue client_activity_stream, customer table
12 ga_cube_retained_subscriptions client_activity_stream, customer table

Composite Metrics (19 models)

# Target Model Source Models Type
13 ga_cube_total_revenue ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue table
14 ga_cube_total_subscriptions ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions table
15 ga_cube_net_revenue ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue table
16 ga_cube_net_subscriptions ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions table
17 ga_cube_committed_revenue ga_cube_total_revenue, ga_cube_net_revenue table
18 ga_cube_revenue_growth_rate ga_cube_total_revenue table
19 ga_cube_subscriber_growth_rate ga_cube_total_subscriptions table
20 ga_cube_revenue_cmgr ga_cube_total_revenue table
21 ga_cube_subscriber_cmgr ga_cube_total_subscriptions table
22 ga_cube_gross_revenue_churn ga_cube_contraction_revenue, ga_cube_churned_revenue table
23 ga_cube_gross_revenue_churn_rate ga_cube_contraction_revenue, ga_cube_churned_revenue, ga_cube_total_revenue table
24 ga_cube_gross_subscriptions_churn ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions table
25 ga_cube_net_revenue_churn ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue table
26 ga_cube_net_revenue_churn_rate ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_total_revenue table
27 ga_cube_gross_dollar_retention ga_cube_gross_revenue_churn_rate table
28 ga_cube_net_dollar_retention ga_cube_net_revenue_churn_rate table
29 quick_ratio_cube ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue table
30 ga_cube_quick_ratio (same as quick_ratio_cube) table

Aggregation Models (3 models)

# Target Model Source Models Type
31 client_activity_stream 10 client_stream_* activity tables table
32 growth_accounting_cube all 30 ga_cube_* models view

Note: client_activity_stream references raw activity tables directly (not via ref()), using a Jinja-generated UNION ALL with a post-hook for window function updates.


Robustness Issues

Issue 1: DuckDB syntax not supported by PostgreSQL parser

DFA used PostgreSQL parser (closest match for DuckDB). Compiled SQL contains DuckDB-specific syntax: - generate_series(1,5000) as table source - to_seconds() function - unnest() as table function - floor(random()*id*10000)::int cast syntax

This caused 18 syntax errors across compiled files.

Issue 2: Jinja template tokenization failure

Same as Snowflake report — all 115 raw model files fail at the tokenizer level due to {{, }}, {%, %} characters.


Side-by-Side: Where Each Approach Excels

Summary

Dimension Winner Evidence Impact
Column-level precision N/A DFA found 0 relationships High
Multi-source expressions N/A No DFA output N/A
FK/ER relationships N/A No FK constraints in this project N/A
fdd vs fdr distinction N/A DFA found 0 relationships N/A
Novel/vendor-specific syntax LLM Jinja macros, DuckDB functions, generate_fake_data() Critical
Semantic understanding LLM Understood generate_fake_data() produces synthetic data Critical
Graceful degradation LLM DFA: 0 output; LLM: all 33 relationships found Critical
Aggregate/string functions N/A No DFA output N/A
Trigger/procedure bodies LLM Understood post_hook UPDATE with window functions Medium

DFA Advantages

None for this codebase. DFA cannot process dbt/Jinja SQL or DuckDB dialect.

LLM Advantages

  • Jinja macro comprehension: Understood generate_fake_data() as a data-generating macro with no upstream dependency
  • ref() dependency graph: Correctly traced 33 model-to-model dependencies through Jinja ref() calls
  • Architecture understanding: Identified the 3-layer pattern: activity streams -> base metrics -> composite metrics -> consolidated cube
  • DuckDB awareness: Correctly identified DuckDB-specific syntax (generate_series, to_seconds, unnest)
  • Post-hook understanding: Identified the UPDATE with ROW_NUMBER/LEAD window functions in client_activity_stream's post_hook as part of the data transformation

Recommendations

  1. DuckDB dialect support: Add DuckDB as a supported dialect (based on PostgreSQL with extensions for generate_series as table source, unnest(), to_seconds(), etc.)
  2. dbt pre-processor: Same recommendation as Snowflake report — wrap compiled SQL in CREATE TABLE/VIEW statements using manifest.json metadata
  3. Shared MantisBT ticket: This report shares #4387 with the Snowflake dbt finding — same root cause (Jinja incompatibility)