Skip to content

SQL Server: SQL Data Warehouse Project

Metric Count
Total lineage found by LLM 9
Total lineage found by DFA 9
DFA matches (correct) 9
DFA crashes (lineage missed) 0
DFA silent misses 0
DFA robustness issues 1
Accuracy 100%

Source: SQL Data Warehouse Project (3-layer ETL: Bronze/Silver/Gold) MantisBT: none Date: 2026-04-03


Performance Comparison

Metric LLM DFA
Wall-clock time 33s 11s
Token usage N/A N/A
Agent count 1 agent 1 process
Approach Read & understand SQL text Parse AST & trace dataflow

Notes: - DFA was ~3x faster than LLM for this small codebase (8 SQL files, 929 lines total). - DFA time includes parsing non-SQL files from the git directory, which added unnecessary overhead.


Lineage Comparison Detail

Views (Gold Layer)

Target Source Tables LLM DFA DFA Columns
gold.dim_customers silver.crm_cust_info, silver.erp_cust_az12, silver.erp_loc_a101 Yes Yes 9 fdd
gold.dim_products silver.crm_prd_info, silver.erp_px_cat_g1v2 Yes Yes 10 fdd
gold.fact_sales silver.crm_sales_details, gold.dim_products, gold.dim_customers Yes Yes 9 fdd

Procedure INSERT...SELECT (Silver Layer)

Target Source Table LLM DFA DFA Columns
silver.crm_cust_info bronze.crm_cust_info Yes Yes 7 fdd
silver.crm_prd_info bronze.crm_prd_info Yes Yes 8 fdd
silver.crm_sales_details bronze.crm_sales_details Yes Yes 9 fdd
silver.erp_cust_az12 bronze.erp_cust_az12 Yes Yes 3 fdd
silver.erp_loc_a101 bronze.erp_loc_a101 Yes Yes 2 fdd
silver.erp_px_cat_g1v2 bronze.erp_px_cat_g1v2 Yes Yes 4 fdd

External Data Loads (Bronze Layer)

LLM identified 6 BULK INSERT operations loading CSV files into bronze tables. DFA correctly recognized the bronze.load_bronze procedure but did not create lineage relationships for BULK INSERT since the source is external (CSV files, not database tables). This is correct behavior.

Robustness Issues

Issue 1: Non-SQL files parsed in directory mode

DFA processed all files in the directory tree (including .git/ contents) when using /d flag, generating 18,000+ syntax errors on git configuration files, hook samples, and pack files. This does not affect lineage accuracy but adds noise to output and wastes processing time.

Recommendation: DFA /d mode should filter to only .sql files, or at minimum skip .git/ directories.


Side-by-Side: Where Each Approach Excels

Summary

Dimension Winner Evidence Impact
Column-level precision DFA 61 column-level fdd relationships traced across 9 lineage paths High
Multi-source expressions DFA CASE/TRIM/LEAD window functions traced to source columns Medium
FK/ER relationships Tie No FK constraints in this DDL-only warehouse schema Low
fdd vs fdr distinction DFA All 61 relationships correctly classified as fdd (direct dataflow) Medium
Novel/vendor-specific syntax Tie Standard T-SQL patterns (BULK INSERT, ROW_NUMBER, LEAD, CASE) Low
Semantic understanding LLM Described business logic: dedup strategy, gender normalization, date validation Low
Graceful degradation Tie No crashes on either side; all SQL parsed successfully Low
Aggregate/string functions DFA TRIM traced through function resultsets to source columns Medium
Trigger/procedure bodies Tie Both correctly traced INSERT...SELECT inside stored procedures High

Detail

DFA Advantages in This Codebase

  • Column-level precision: DFA traced 61 individual column-to-column relationships. For example, gold.dim_customers.first_name <- silver.crm_cust_info.cst_firstname via TRIM function. LLM identified table-level lineage but did not enumerate every column mapping.
  • Function tracing: DFA created intermediate resultset nodes for TRIM, NULLIF, and other functions, showing how data flows through transformations at the column level.
  • Procedure detection: DFA identified silver.load_silver as a procedure and correctly attributed all 6 INSERT operations to it, with procedure IDs linking processes to their parent procedure.

LLM Advantages in This Codebase

  • Business logic understanding: LLM described transformation intent -- e.g., "normalize marital status (S->Single, M->Married)", "dedup by most recent cst_create_date per customer", "filter future birthdates to NULL". DFA traces the dataflow but doesn't capture business meaning.
  • Architecture comprehension: LLM identified the 3-layer ETL pattern (Bronze->Silver->Gold), star schema design, and multi-source integration (CRM + ERP) -- structural insights that DFA doesn't produce.

Tie / Both Correct

  • Both approaches correctly identified all 9 lineage relationships with matching source/target tables.
  • Both correctly handled stored procedure bodies with TRUNCATE+INSERT patterns.
  • Both correctly handled CREATE VIEW with multi-table LEFT JOINs and window functions (ROW_NUMBER, LEAD).
  • Neither had issues with the standard T-SQL syntax used in this project.

Recommendations

  1. DFA /d mode should skip .git/ directories -- this would eliminate 18,000+ spurious error messages and improve performance.
  2. BULK INSERT lineage: Consider adding optional support for BULK INSERT source tracking (noting the CSV filename as an external source), useful for end-to-end lineage in ETL pipelines.
  3. No parser fixes needed: All SQL in this project parsed successfully with the MSSQL parser.