SQL Server: SQL Data Warehouse Project¶
| Metric | Count |
|---|---|
| Total lineage found by LLM | 9 |
| Total lineage found by DFA | 9 |
| DFA matches (correct) | 9 |
| DFA crashes (lineage missed) | 0 |
| DFA silent misses | 0 |
| DFA robustness issues | 1 |
| Accuracy | 100% |
Source: SQL Data Warehouse Project (3-layer ETL: Bronze/Silver/Gold) MantisBT: none Date: 2026-04-03
Performance Comparison¶
| Metric | LLM | DFA |
|---|---|---|
| Wall-clock time | 33s | 11s |
| Token usage | N/A | N/A |
| Agent count | 1 agent | 1 process |
| Approach | Read & understand SQL text | Parse AST & trace dataflow |
Notes: - DFA was ~3x faster than LLM for this small codebase (8 SQL files, 929 lines total). - DFA time includes parsing non-SQL files from the git directory, which added unnecessary overhead.
Lineage Comparison Detail¶
Views (Gold Layer)¶
| Target | Source Tables | LLM | DFA | DFA Columns |
|---|---|---|---|---|
| gold.dim_customers | silver.crm_cust_info, silver.erp_cust_az12, silver.erp_loc_a101 | Yes | Yes | 9 fdd |
| gold.dim_products | silver.crm_prd_info, silver.erp_px_cat_g1v2 | Yes | Yes | 10 fdd |
| gold.fact_sales | silver.crm_sales_details, gold.dim_products, gold.dim_customers | Yes | Yes | 9 fdd |
Procedure INSERT...SELECT (Silver Layer)¶
| Target | Source Table | LLM | DFA | DFA Columns |
|---|---|---|---|---|
| silver.crm_cust_info | bronze.crm_cust_info | Yes | Yes | 7 fdd |
| silver.crm_prd_info | bronze.crm_prd_info | Yes | Yes | 8 fdd |
| silver.crm_sales_details | bronze.crm_sales_details | Yes | Yes | 9 fdd |
| silver.erp_cust_az12 | bronze.erp_cust_az12 | Yes | Yes | 3 fdd |
| silver.erp_loc_a101 | bronze.erp_loc_a101 | Yes | Yes | 2 fdd |
| silver.erp_px_cat_g1v2 | bronze.erp_px_cat_g1v2 | Yes | Yes | 4 fdd |
External Data Loads (Bronze Layer)¶
LLM identified 6 BULK INSERT operations loading CSV files into bronze tables. DFA correctly recognized the bronze.load_bronze procedure but did not create lineage relationships for BULK INSERT since the source is external (CSV files, not database tables). This is correct behavior.
Robustness Issues¶
Issue 1: Non-SQL files parsed in directory mode¶
DFA processed all files in the directory tree (including .git/ contents) when using /d flag, generating 18,000+ syntax errors on git configuration files, hook samples, and pack files. This does not affect lineage accuracy but adds noise to output and wastes processing time.
Recommendation: DFA /d mode should filter to only .sql files, or at minimum skip .git/ directories.
Side-by-Side: Where Each Approach Excels¶
Summary¶
| Dimension | Winner | Evidence | Impact |
|---|---|---|---|
| Column-level precision | DFA | 61 column-level fdd relationships traced across 9 lineage paths | High |
| Multi-source expressions | DFA | CASE/TRIM/LEAD window functions traced to source columns | Medium |
| FK/ER relationships | Tie | No FK constraints in this DDL-only warehouse schema | Low |
| fdd vs fdr distinction | DFA | All 61 relationships correctly classified as fdd (direct dataflow) | Medium |
| Novel/vendor-specific syntax | Tie | Standard T-SQL patterns (BULK INSERT, ROW_NUMBER, LEAD, CASE) | Low |
| Semantic understanding | LLM | Described business logic: dedup strategy, gender normalization, date validation | Low |
| Graceful degradation | Tie | No crashes on either side; all SQL parsed successfully | Low |
| Aggregate/string functions | DFA | TRIM traced through function resultsets to source columns | Medium |
| Trigger/procedure bodies | Tie | Both correctly traced INSERT...SELECT inside stored procedures | High |
Detail¶
DFA Advantages in This Codebase¶
- Column-level precision: DFA traced 61 individual column-to-column relationships. For example,
gold.dim_customers.first_name<-silver.crm_cust_info.cst_firstnamevia TRIM function. LLM identified table-level lineage but did not enumerate every column mapping. - Function tracing: DFA created intermediate
resultsetnodes for TRIM, NULLIF, and other functions, showing how data flows through transformations at the column level. - Procedure detection: DFA identified
silver.load_silveras a procedure and correctly attributed all 6 INSERT operations to it, with procedure IDs linking processes to their parent procedure.
LLM Advantages in This Codebase¶
- Business logic understanding: LLM described transformation intent -- e.g., "normalize marital status (S->Single, M->Married)", "dedup by most recent cst_create_date per customer", "filter future birthdates to NULL". DFA traces the dataflow but doesn't capture business meaning.
- Architecture comprehension: LLM identified the 3-layer ETL pattern (Bronze->Silver->Gold), star schema design, and multi-source integration (CRM + ERP) -- structural insights that DFA doesn't produce.
Tie / Both Correct¶
- Both approaches correctly identified all 9 lineage relationships with matching source/target tables.
- Both correctly handled stored procedure bodies with TRUNCATE+INSERT patterns.
- Both correctly handled CREATE VIEW with multi-table LEFT JOINs and window functions (ROW_NUMBER, LEAD).
- Neither had issues with the standard T-SQL syntax used in this project.
Recommendations¶
- DFA
/dmode should skip.git/directories -- this would eliminate 18,000+ spurious error messages and improve performance. - BULK INSERT lineage: Consider adding optional support for BULK INSERT source tracking (noting the CSV filename as an external source), useful for end-to-end lineage in ETL pipelines.
- No parser fixes needed: All SQL in this project parsed successfully with the MSSQL parser.