Skip to content

Data Lineage Validation

This section documents the accuracy of GSP's DataFlowAnalyzer (DFA) by comparing its output against independent LLM-based lineage analysis across multiple database vendors and real-world SQL codebases.

Both approaches have distinct strengths and limitations. Together they provide a more complete picture than either alone.

Methodology

Each validation report follows a consistent process:

  1. LLM Analysis — An LLM reads the SQL files and identifies all data lineage relationships (views, INSERT...SELECT, CTAS, triggers, stored procedures, MERGE statements)
  2. DFA Analysis — The GSP DataFlowAnalyzer processes the same SQL files programmatically
  3. Comparison — Results are compared and discrepancies categorized:
Category Severity Description
Cat 1: DFA Crash Critical DFA threw an exception, lineage missed
Cat 2: Silent Miss High DFA produced no error but lineage not detected
Cat 3: DFA Extra Info DFA found lineage LLM missed (usually FK relationships)
Cat 4: Robustness Low DFA errors that don't affect lineage accuracy

Reports by Vendor

Vendor Test Suites Overall Accuracy
Oracle 1 96%
SQL Server 1 100%
Snowflake 1 0%
DuckDB 1 0%

DFA Strengths vs LLM Strengths

The two approaches are complementary. Each excels where the other struggles.

Where DFA Outperforms LLM

Capability DFA LLM
Column-level lineage Precise column-to-column mappings (e.g., view.salaryemployees.SALARY) Table-level only; cannot reliably trace individual columns through complex joins
Multi-source computed columns Correctly identifies all contributing columns (e.g., TOTAL_COSTunit_cost + quantity_sold) May miss secondary source columns in expressions
Foreign key relationships Automatically extracted from DDL constraints as ER lineage Often noted as structural info but not counted as lineage
Deterministic and reproducible Same input always produces same output Results vary between runs; may miss or hallucinate relationships
Scale Handles thousands of SQL files in batch Context window limits restrict to smaller file sets
COALESCE / multi-branch expressions Traces all branches to their source columns May oversimplify to the "primary" source
Indirect dataflow (fdr) Distinguishes direct (fdd) vs indirect (fdr) relationships (WHERE, JOIN ON, GROUP BY) Does not differentiate — reports all as "lineage"
Object views and type hierarchies Correctly handles Oracle object views (OF type), UNDER subviews, MAKE_REF May misidentify nested object-relational patterns

Where LLM Outperforms DFA

Capability LLM DFA
Novel SQL syntax Understands any SQL pattern from language knowledge, even vendor-specific extensions not yet in the parser Fails on unsupported syntax (parse error = zero lineage)
Semantic understanding Understands that DECODE(active, 1, 'active', '') means "the output depends on active" Treats DECODE arguments mechanically per function.properties; condition args are fdr by default
Template languages (Jinja/dbt) Can read through {{ ref('model') }} and understand the intended SQL Cannot parse non-SQL template syntax at all
Dynamic SQL Can reason about SQL constructed via string concatenation in procedures Cannot trace lineage through dynamic SQL strings
Cross-file context Understands that a procedure in file A writes to a table read by a view in file B Limited cross-file resolution without external metadata
Intent and business logic Can explain why data flows a certain way, not just that it does Produces structural relationships without semantic context
Graceful degradation Never crashes; always produces some analysis even on malformed SQL Crashes (NPE, OOM, IndexOutOfBounds) on certain SQL patterns
Aggregate function arguments Understands that GROUP_CONCAT(CONCAT(first_name, last_name)) depends on both name columns Does not trace arguments inside GROUP_CONCAT, LISTAGG, STRING_AGG to source columns
Trigger pseudo-records Correctly resolves NEW.column / OLD.column to the triggering table's columns May resolve NEW.col as self-reference to the target table
CREATE RULE / partition routing Understands PostgreSQL rules as data routing from parent to partition tables Partial or no detection of CREATE RULE lineage

DFA Known Limitations

Patterns that consistently challenge the DataFlowAnalyzer across validations:

Resolved with Configuration

Pattern Affected Vendors Resolution
CASE WHEN condition columns not traced All Enable /showCaseWhenAsDirect

Open Issues

Pattern Affected Vendors Root Cause
Oracle DECODE() first argument Oracle, GaussDB Treated as fdr in function.properties; /showCaseWhenAsDirect does not cover it
GROUP_CONCAT / LISTAGG / STRING_AGG MySQL, PostgreSQL, SQL Server Aggregate function arguments not traced to source columns
Trigger NEW pseudo-record resolution MySQL NEW.col resolves to target table (self) instead of source table
JSON_TABLE with NESTED PATH Oracle NullPointerException in column model creation
CAST(MULTISET(SELECT ...)) in VALUES Oracle IndexOutOfBoundsException in INSERT analysis
PL/pgSQL ELSIF, RECORD, FOR..IN EXECUTE Redshift Parser does not support these PL/pgSQL constructs
ARRAY_AGG(DISTINCT ... IGNORE NULLS) BigQuery NullPointerException in aggregate analysis
BigQuery 2-arg TRIM(expr, chars) BigQuery Parser only supports standard TRIM(LEADING/TRAILING FROM ...)
BigQuery scripting (FOR..IN..DO, BEGIN..EXCEPTION) BigQuery Scripting constructs not supported
PostgreSQL CREATE RULE partition routing PostgreSQL Partially detected — column-level lineage incomplete
Function body table reads PostgreSQL SELECT INTO within functions not emitted in simple output mode

Expected Limitations (Not Bugs)

Pattern Reason
Jinja/dbt templates ({{ ref('...') }}) Not SQL — requires template pre-processing
Dynamic SQL (EXECUTE IMMEDIATE, string concatenation) SQL text constructed at runtime, unknowable at parse time
External data loads (SQL*Loader, COPY FROM) Data source is outside SQL; no SQL-level lineage

LLM Known Limitations

Patterns where LLM analysis is unreliable or incomplete:

Structural Limitations

Limitation Impact Example
No column-level precision LLM identifies table-to-table flows but cannot reliably map individual columns through multi-table joins with aliases SELECT e.name, d.dept_name FROM emp e JOIN dept d ON ... — LLM knows both tables contribute but may not map which column goes where
Cannot distinguish fdd vs fdr All relationships reported as "lineage" without differentiating direct dataflow from indirect (WHERE/JOIN conditions) WHERE status = 'A' — LLM reports status as lineage, but it's fdr (indirect), not fdd (direct)
Hallucination risk May infer lineage that doesn't exist, especially with ambiguous column names or complex CTEs A column named id in a CTE may be incorrectly attributed to the wrong source table
Non-deterministic Different runs on the same SQL may produce slightly different results One run may catch a subtle trigger flow, the next may miss it
Context window limits Cannot process very large SQL files (>200K lines) or very many files simultaneously The 231K-line INSERT data files must be skipped entirely

Analytical Limitations

Limitation Impact
Misses FK/ER relationships Does not systematically extract foreign key constraints from DDL as lineage
Over-counts INSERT...VALUES May report literal INSERT statements as "lineage" when there's no source table
Inconsistent trigger analysis May miss INSTEAD OF triggers or misidentify which tables are source vs target
No incremental analysis Cannot efficiently re-analyze only changed files; must re-read everything
Cannot verify correctness LLM lineage cannot be programmatically validated — it's an opinion, not a proof

When to Use Which Approach

Scenario Recommended Approach
Production lineage for data governance DFA — deterministic, column-level, auditable
Quick assessment of unfamiliar SQL codebase LLM — handles any syntax, gives semantic context
Validating DFA accuracy on new SQL patterns Both — use LLM as oracle, DFA as system under test
Lineage through dbt/Jinja templates LLM — DFA cannot parse templates
Lineage for compliance/audit trails DFA — reproducible, machine-readable output
Understanding why data flows a certain way LLM — can explain business logic and intent
Processing 10,000+ SQL files DFA — scales to any size with batch/parallel processing
Debugging a specific query's lineage DFA with /showCaseWhenAsDirect and /env metadata.json — precise column-level detail

Configuration Impact

Some DFA options significantly affect lineage completeness:

Option Impact
/showCaseWhenAsDirect Resolves all CASE WHEN condition column misses
/showConstant Makes literal value flows visible
/env metadata.json Improves orphan column resolution via external schema metadata
/lof Links unresolved columns to first table (fallback heuristic)
/treatArgumentsInCountFunctionAsDirectDataflow Treats aggregate function arguments as direct dataflow

See Configuration Options for full details.