DuckDB: SOMA B2B SaaS (dbt)¶

Metric	Count
Total lineage found by LLM	33
Total lineage found by DFA	0
DFA matches (correct)	0
DFA crashes (lineage missed)	0
DFA silent misses	33
DFA robustness issues	2
Accuracy	0%

Source Repository: SOMA-B2B-SaaS MantisBT: #4387 (shared with Snowflake dbt ticket — same root cause) Date: 2026-04-03

Performance Comparison¶

Metric	LLM	DFA
Wall-clock time	79s	4s
Token usage	N/A	N/A
Agent count	2 parallel agents	1 process
Approach	Read & understand SQL + Jinja	Parse AST & trace dataflow

Project Overview¶

SOMA (SaaS Operational Metrics & Analytics) is a demo/sample dbt project targeting DuckDB. It models the complete B2B SaaS customer lifecycle using an event-stream architecture:

115 model files (all use Jinja templates)
82 activity models — generate synthetic data via generate_fake_data() macro (no upstream dependencies)
33 models with actual lineage — reference other models via ref()
Architecture: Activity streams -> Growth accounting metrics -> Consolidated cube

Root Cause: dbt/Jinja + DuckDB Incompatibility¶

Same root cause as Snowflake/mattermost-data-warehouse:

Raw dbt SQL (with Jinja): DFA cannot tokenize {{, }}, {%, %} — all 115 files fail to parse
Compiled dbt SQL (no Jinja): DFA found 14 table structures but 0 lineage relationships — bare SELECT without CREATE VIEW wrappers
DuckDB-specific syntax: Compiled SQL uses DuckDB functions (generate_series, to_seconds, unnest) that the PostgreSQL parser doesn't support — 18 syntax errors on compiled files

All 33 LLM-Detected Lineage Relationships (DFA: 0 of 33)¶

Base Metrics (12 models) — all ref `client_activity_stream` + `customer`¶

#	Target Model	Source Models	Type
1	ga_cube_new_revenue	client_activity_stream, customer	table
2	ga_cube_new_subscriptions	client_activity_stream, customer	table
3	ga_cube_expansion_revenue	client_activity_stream, customer	table
4	ga_cube_expansion_subscriptions	client_activity_stream, customer	table
5	ga_cube_contraction_revenue	client_activity_stream, customer	table
6	ga_cube_contraction_subscriptions	client_activity_stream, customer	table
7	ga_cube_churned_revenue	client_activity_stream, customer	table
8	ga_cube_churned_subscriptions	client_activity_stream, customer	table
9	ga_cube_resurrected_revenue	client_activity_stream, customer	table
10	ga_cube_resurrected_subscriptions	client_activity_stream, customer	table
11	ga_cube_retained_revenue	client_activity_stream, customer	table
12	ga_cube_retained_subscriptions	client_activity_stream, customer	table

Composite Metrics (19 models)¶

#	Target Model	Source Models	Type
13	ga_cube_total_revenue	ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue	table
14	ga_cube_total_subscriptions	ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions	table
15	ga_cube_net_revenue	ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue	table
16	ga_cube_net_subscriptions	ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions	table
17	ga_cube_committed_revenue	ga_cube_total_revenue, ga_cube_net_revenue	table
18	ga_cube_revenue_growth_rate	ga_cube_total_revenue	table
19	ga_cube_subscriber_growth_rate	ga_cube_total_subscriptions	table
20	ga_cube_revenue_cmgr	ga_cube_total_revenue	table
21	ga_cube_subscriber_cmgr	ga_cube_total_subscriptions	table
22	ga_cube_gross_revenue_churn	ga_cube_contraction_revenue, ga_cube_churned_revenue	table
23	ga_cube_gross_revenue_churn_rate	ga_cube_contraction_revenue, ga_cube_churned_revenue, ga_cube_total_revenue	table
24	ga_cube_gross_subscriptions_churn	ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions	table
25	ga_cube_net_revenue_churn	ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue	table
26	ga_cube_net_revenue_churn_rate	ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_total_revenue	table
27	ga_cube_gross_dollar_retention	ga_cube_gross_revenue_churn_rate	table
28	ga_cube_net_dollar_retention	ga_cube_net_revenue_churn_rate	table
29	quick_ratio_cube	ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue	table
30	ga_cube_quick_ratio	(same as quick_ratio_cube)	table

Aggregation Models (3 models)¶

#	Target Model	Source Models	Type
31	client_activity_stream	10 client_stream_* activity tables	table
32	growth_accounting_cube	all 30 ga_cube_* models	view

Note: client_activity_stream references raw activity tables directly (not via ref()), using a Jinja-generated UNION ALL with a post-hook for window function updates.

Robustness Issues¶

Issue 1: DuckDB syntax not supported by PostgreSQL parser¶

DFA used PostgreSQL parser (closest match for DuckDB). Compiled SQL contains DuckDB-specific syntax: - generate_series(1,5000) as table source - to_seconds() function - unnest() as table function - floor(random()*id*10000)::int cast syntax

This caused 18 syntax errors across compiled files.

Issue 2: Jinja template tokenization failure¶

Same as Snowflake report — all 115 raw model files fail at the tokenizer level due to {{, }}, {%, %} characters.

Side-by-Side: Where Each Approach Excels¶

Summary¶

Dimension	Winner	Evidence	Impact
Column-level precision	N/A	DFA found 0 relationships	High
Multi-source expressions	N/A	No DFA output	N/A
FK/ER relationships	N/A	No FK constraints in this project	N/A
fdd vs fdr distinction	N/A	DFA found 0 relationships	N/A
Novel/vendor-specific syntax	LLM	Jinja macros, DuckDB functions, generate_fake_data()	Critical
Semantic understanding	LLM	Understood generate_fake_data() produces synthetic data	Critical
Graceful degradation	LLM	DFA: 0 output; LLM: all 33 relationships found	Critical
Aggregate/string functions	N/A	No DFA output	N/A
Trigger/procedure bodies	LLM	Understood post_hook UPDATE with window functions	Medium

DFA Advantages¶

None for this codebase. DFA cannot process dbt/Jinja SQL or DuckDB dialect.

LLM Advantages¶

Jinja macro comprehension: Understood generate_fake_data() as a data-generating macro with no upstream dependency
ref() dependency graph: Correctly traced 33 model-to-model dependencies through Jinja ref() calls
Architecture understanding: Identified the 3-layer pattern: activity streams -> base metrics -> composite metrics -> consolidated cube
DuckDB awareness: Correctly identified DuckDB-specific syntax (generate_series, to_seconds, unnest)
Post-hook understanding: Identified the UPDATE with ROW_NUMBER/LEAD window functions in client_activity_stream's post_hook as part of the data transformation

Recommendations¶

DuckDB dialect support: Add DuckDB as a supported dialect (based on PostgreSQL with extensions for generate_series as table source, unnest(), to_seconds(), etc.)
dbt pre-processor: Same recommendation as Snowflake report — wrap compiled SQL in CREATE TABLE/VIEW statements using manifest.json metadata
Shared MantisBT ticket: This report shares #4387 with the Snowflake dbt finding — same root cause (Jinja incompatibility)