DuckDB: SOMA B2B SaaS (dbt)¶
| Metric | Count |
|---|---|
| Total lineage found by LLM | 33 |
| Total lineage found by DFA | 0 |
| DFA matches (correct) | 0 |
| DFA crashes (lineage missed) | 0 |
| DFA silent misses | 33 |
| DFA robustness issues | 2 |
| Accuracy | 0% |
Source Repository: SOMA-B2B-SaaS MantisBT: #4387 (shared with Snowflake dbt ticket — same root cause) Date: 2026-04-03
Performance Comparison¶
| Metric | LLM | DFA |
|---|---|---|
| Wall-clock time | 79s | 4s |
| Token usage | N/A | N/A |
| Agent count | 2 parallel agents | 1 process |
| Approach | Read & understand SQL + Jinja | Parse AST & trace dataflow |
Project Overview¶
SOMA (SaaS Operational Metrics & Analytics) is a demo/sample dbt project targeting DuckDB. It models the complete B2B SaaS customer lifecycle using an event-stream architecture:
- 115 model files (all use Jinja templates)
- 82 activity models — generate synthetic data via
generate_fake_data()macro (no upstream dependencies) - 33 models with actual lineage — reference other models via
ref() - Architecture: Activity streams -> Growth accounting metrics -> Consolidated cube
Root Cause: dbt/Jinja + DuckDB Incompatibility¶
Same root cause as Snowflake/mattermost-data-warehouse:
- Raw dbt SQL (with Jinja): DFA cannot tokenize
{{,}},{%,%}— all 115 files fail to parse - Compiled dbt SQL (no Jinja): DFA found 14 table structures but 0 lineage relationships — bare SELECT without CREATE VIEW wrappers
- DuckDB-specific syntax: Compiled SQL uses DuckDB functions (
generate_series,to_seconds,unnest) that the PostgreSQL parser doesn't support — 18 syntax errors on compiled files
All 33 LLM-Detected Lineage Relationships (DFA: 0 of 33)¶
Base Metrics (12 models) — all ref client_activity_stream + customer¶
| # | Target Model | Source Models | Type |
|---|---|---|---|
| 1 | ga_cube_new_revenue | client_activity_stream, customer | table |
| 2 | ga_cube_new_subscriptions | client_activity_stream, customer | table |
| 3 | ga_cube_expansion_revenue | client_activity_stream, customer | table |
| 4 | ga_cube_expansion_subscriptions | client_activity_stream, customer | table |
| 5 | ga_cube_contraction_revenue | client_activity_stream, customer | table |
| 6 | ga_cube_contraction_subscriptions | client_activity_stream, customer | table |
| 7 | ga_cube_churned_revenue | client_activity_stream, customer | table |
| 8 | ga_cube_churned_subscriptions | client_activity_stream, customer | table |
| 9 | ga_cube_resurrected_revenue | client_activity_stream, customer | table |
| 10 | ga_cube_resurrected_subscriptions | client_activity_stream, customer | table |
| 11 | ga_cube_retained_revenue | client_activity_stream, customer | table |
| 12 | ga_cube_retained_subscriptions | client_activity_stream, customer | table |
Composite Metrics (19 models)¶
| # | Target Model | Source Models | Type |
|---|---|---|---|
| 13 | ga_cube_total_revenue | ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue | table |
| 14 | ga_cube_total_subscriptions | ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions | table |
| 15 | ga_cube_net_revenue | ga_cube_new_revenue, ga_cube_expansion_revenue, ga_cube_contraction_revenue, ga_cube_churned_revenue | table |
| 16 | ga_cube_net_subscriptions | ga_cube_new_subscriptions, ga_cube_expansion_subscriptions, ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions | table |
| 17 | ga_cube_committed_revenue | ga_cube_total_revenue, ga_cube_net_revenue | table |
| 18 | ga_cube_revenue_growth_rate | ga_cube_total_revenue | table |
| 19 | ga_cube_subscriber_growth_rate | ga_cube_total_subscriptions | table |
| 20 | ga_cube_revenue_cmgr | ga_cube_total_revenue | table |
| 21 | ga_cube_subscriber_cmgr | ga_cube_total_subscriptions | table |
| 22 | ga_cube_gross_revenue_churn | ga_cube_contraction_revenue, ga_cube_churned_revenue | table |
| 23 | ga_cube_gross_revenue_churn_rate | ga_cube_contraction_revenue, ga_cube_churned_revenue, ga_cube_total_revenue | table |
| 24 | ga_cube_gross_subscriptions_churn | ga_cube_contraction_subscriptions, ga_cube_churned_subscriptions | table |
| 25 | ga_cube_net_revenue_churn | ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue | table |
| 26 | ga_cube_net_revenue_churn_rate | ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_total_revenue | table |
| 27 | ga_cube_gross_dollar_retention | ga_cube_gross_revenue_churn_rate | table |
| 28 | ga_cube_net_dollar_retention | ga_cube_net_revenue_churn_rate | table |
| 29 | quick_ratio_cube | ga_cube_expansion_revenue, ga_cube_churned_revenue, ga_cube_contraction_revenue, ga_cube_new_revenue | table |
| 30 | ga_cube_quick_ratio | (same as quick_ratio_cube) | table |
Aggregation Models (3 models)¶
| # | Target Model | Source Models | Type |
|---|---|---|---|
| 31 | client_activity_stream | 10 client_stream_* activity tables | table |
| 32 | growth_accounting_cube | all 30 ga_cube_* models | view |
Note: client_activity_stream references raw activity tables directly (not via ref()), using a Jinja-generated UNION ALL with a post-hook for window function updates.
Robustness Issues¶
Issue 1: DuckDB syntax not supported by PostgreSQL parser¶
DFA used PostgreSQL parser (closest match for DuckDB). Compiled SQL contains DuckDB-specific syntax:
- generate_series(1,5000) as table source
- to_seconds() function
- unnest() as table function
- floor(random()*id*10000)::int cast syntax
This caused 18 syntax errors across compiled files.
Issue 2: Jinja template tokenization failure¶
Same as Snowflake report — all 115 raw model files fail at the tokenizer level due to {{, }}, {%, %} characters.
Side-by-Side: Where Each Approach Excels¶
Summary¶
| Dimension | Winner | Evidence | Impact |
|---|---|---|---|
| Column-level precision | N/A | DFA found 0 relationships | High |
| Multi-source expressions | N/A | No DFA output | N/A |
| FK/ER relationships | N/A | No FK constraints in this project | N/A |
| fdd vs fdr distinction | N/A | DFA found 0 relationships | N/A |
| Novel/vendor-specific syntax | LLM | Jinja macros, DuckDB functions, generate_fake_data() | Critical |
| Semantic understanding | LLM | Understood generate_fake_data() produces synthetic data | Critical |
| Graceful degradation | LLM | DFA: 0 output; LLM: all 33 relationships found | Critical |
| Aggregate/string functions | N/A | No DFA output | N/A |
| Trigger/procedure bodies | LLM | Understood post_hook UPDATE with window functions | Medium |
DFA Advantages¶
None for this codebase. DFA cannot process dbt/Jinja SQL or DuckDB dialect.
LLM Advantages¶
- Jinja macro comprehension: Understood
generate_fake_data()as a data-generating macro with no upstream dependency - ref() dependency graph: Correctly traced 33 model-to-model dependencies through Jinja ref() calls
- Architecture understanding: Identified the 3-layer pattern: activity streams -> base metrics -> composite metrics -> consolidated cube
- DuckDB awareness: Correctly identified DuckDB-specific syntax (
generate_series,to_seconds,unnest) - Post-hook understanding: Identified the UPDATE with ROW_NUMBER/LEAD window functions in client_activity_stream's post_hook as part of the data transformation
Recommendations¶
- DuckDB dialect support: Add DuckDB as a supported dialect (based on PostgreSQL with extensions for
generate_seriesas table source,unnest(),to_seconds(), etc.) - dbt pre-processor: Same recommendation as Snowflake report — wrap compiled SQL in
CREATE TABLE/VIEWstatements usingmanifest.jsonmetadata - Shared MantisBT ticket: This report shares #4387 with the Snowflake dbt finding — same root cause (Jinja incompatibility)