Data Lineage Validation¶

This section documents the accuracy of GSP's DataFlowAnalyzer (DFA) by comparing its output against independent LLM-based lineage analysis across multiple database vendors and real-world SQL codebases.

Both approaches have distinct strengths and limitations. Together they provide a more complete picture than either alone.

Methodology¶

Each validation report follows a consistent process:

LLM Analysis — An LLM reads the SQL files and identifies all data lineage relationships (views, INSERT...SELECT, CTAS, triggers, stored procedures, MERGE statements)
DFA Analysis — The GSP DataFlowAnalyzer processes the same SQL files programmatically
Comparison — Results are compared and discrepancies categorized:

Category	Severity	Description
Cat 1: DFA Crash	Critical	DFA threw an exception, lineage missed
Cat 2: Silent Miss	High	DFA produced no error but lineage not detected
Cat 3: DFA Extra	Info	DFA found lineage LLM missed (usually FK relationships)
Cat 4: Robustness	Low	DFA errors that don't affect lineage accuracy

Reports by Vendor¶

Vendor	Test Suites	Overall Accuracy
Oracle	1	96%
SQL Server	1	100%
Snowflake	1	0%
DuckDB	1	0%

DFA Strengths vs LLM Strengths¶

The two approaches are complementary. Each excels where the other struggles.

Where DFA Outperforms LLM¶

Capability	DFA	LLM
Column-level lineage	Precise column-to-column mappings (e.g., `view.salary` ← `employees.SALARY`)	Table-level only; cannot reliably trace individual columns through complex joins
Multi-source computed columns	Correctly identifies all contributing columns (e.g., `TOTAL_COST` ← `unit_cost` + `quantity_sold`)	May miss secondary source columns in expressions
Foreign key relationships	Automatically extracted from DDL constraints as ER lineage	Often noted as structural info but not counted as lineage
Deterministic and reproducible	Same input always produces same output	Results vary between runs; may miss or hallucinate relationships
Scale	Handles thousands of SQL files in batch	Context window limits restrict to smaller file sets
COALESCE / multi-branch expressions	Traces all branches to their source columns	May oversimplify to the "primary" source
Indirect dataflow (fdr)	Distinguishes direct (fdd) vs indirect (fdr) relationships (WHERE, JOIN ON, GROUP BY)	Does not differentiate — reports all as "lineage"
Object views and type hierarchies	Correctly handles Oracle object views (`OF type`), `UNDER` subviews, `MAKE_REF`	May misidentify nested object-relational patterns

Where LLM Outperforms DFA¶

Capability	LLM	DFA
Novel SQL syntax	Understands any SQL pattern from language knowledge, even vendor-specific extensions not yet in the parser	Fails on unsupported syntax (parse error = zero lineage)
Semantic understanding	Understands that `DECODE(active, 1, 'active', '')` means "the output depends on `active`"	Treats DECODE arguments mechanically per function.properties; condition args are `fdr` by default
Template languages (Jinja/dbt)	Can read through `{{ ref('model') }}` and understand the intended SQL	Cannot parse non-SQL template syntax at all
Dynamic SQL	Can reason about SQL constructed via string concatenation in procedures	Cannot trace lineage through dynamic SQL strings
Cross-file context	Understands that a procedure in file A writes to a table read by a view in file B	Limited cross-file resolution without external metadata
Intent and business logic	Can explain why data flows a certain way, not just that it does	Produces structural relationships without semantic context
Graceful degradation	Never crashes; always produces some analysis even on malformed SQL	Crashes (NPE, OOM, IndexOutOfBounds) on certain SQL patterns
Aggregate function arguments	Understands that `GROUP_CONCAT(CONCAT(first_name, last_name))` depends on both name columns	Does not trace arguments inside `GROUP_CONCAT`, `LISTAGG`, `STRING_AGG` to source columns
Trigger pseudo-records	Correctly resolves `NEW.column` / `OLD.column` to the triggering table's columns	May resolve `NEW.col` as self-reference to the target table
CREATE RULE / partition routing	Understands PostgreSQL rules as data routing from parent to partition tables	Partial or no detection of `CREATE RULE` lineage

DFA Known Limitations¶

Patterns that consistently challenge the DataFlowAnalyzer across validations:

Resolved with Configuration¶

Pattern	Affected Vendors	Resolution
`CASE WHEN` condition columns not traced	All	Enable `/showCaseWhenAsDirect`

Open Issues¶

Pattern	Affected Vendors	Root Cause
Oracle `DECODE()` first argument	Oracle, GaussDB	Treated as `fdr` in function.properties; `/showCaseWhenAsDirect` does not cover it
`GROUP_CONCAT` / `LISTAGG` / `STRING_AGG`	MySQL, PostgreSQL, SQL Server	Aggregate function arguments not traced to source columns
Trigger `NEW` pseudo-record resolution	MySQL	`NEW.col` resolves to target table (self) instead of source table
`JSON_TABLE` with `NESTED PATH`	Oracle	`NullPointerException` in column model creation
`CAST(MULTISET(SELECT ...))` in VALUES	Oracle	`IndexOutOfBoundsException` in INSERT analysis
PL/pgSQL `ELSIF`, `RECORD`, `FOR..IN EXECUTE`	Redshift	Parser does not support these PL/pgSQL constructs
`ARRAY_AGG(DISTINCT ... IGNORE NULLS)`	BigQuery	`NullPointerException` in aggregate analysis
BigQuery 2-arg `TRIM(expr, chars)`	BigQuery	Parser only supports standard `TRIM(LEADING/TRAILING FROM ...)`
BigQuery scripting (`FOR..IN..DO`, `BEGIN..EXCEPTION`)	BigQuery	Scripting constructs not supported
PostgreSQL `CREATE RULE` partition routing	PostgreSQL	Partially detected — column-level lineage incomplete
Function body table reads	PostgreSQL	SELECT INTO within functions not emitted in simple output mode

Expected Limitations (Not Bugs)¶

Pattern	Reason
Jinja/dbt templates (`{{ ref('...') }}`)	Not SQL — requires template pre-processing
Dynamic SQL (`EXECUTE IMMEDIATE`, string concatenation)	SQL text constructed at runtime, unknowable at parse time
External data loads (SQL*Loader, `COPY FROM`)	Data source is outside SQL; no SQL-level lineage

LLM Known Limitations¶

Patterns where LLM analysis is unreliable or incomplete:

Structural Limitations¶

Limitation	Impact	Example
No column-level precision	LLM identifies table-to-table flows but cannot reliably map individual columns through multi-table joins with aliases	`SELECT e.name, d.dept_name FROM emp e JOIN dept d ON ...` — LLM knows both tables contribute but may not map which column goes where
Cannot distinguish fdd vs fdr	All relationships reported as "lineage" without differentiating direct dataflow from indirect (WHERE/JOIN conditions)	`WHERE status = 'A'` — LLM reports `status` as lineage, but it's fdr (indirect), not fdd (direct)
Hallucination risk	May infer lineage that doesn't exist, especially with ambiguous column names or complex CTEs	A column named `id` in a CTE may be incorrectly attributed to the wrong source table
Non-deterministic	Different runs on the same SQL may produce slightly different results	One run may catch a subtle trigger flow, the next may miss it
Context window limits	Cannot process very large SQL files (>200K lines) or very many files simultaneously	The 231K-line INSERT data files must be skipped entirely

Analytical Limitations¶

Limitation	Impact
Misses FK/ER relationships	Does not systematically extract foreign key constraints from DDL as lineage
Over-counts INSERT...VALUES	May report literal INSERT statements as "lineage" when there's no source table
Inconsistent trigger analysis	May miss INSTEAD OF triggers or misidentify which tables are source vs target
No incremental analysis	Cannot efficiently re-analyze only changed files; must re-read everything
Cannot verify correctness	LLM lineage cannot be programmatically validated — it's an opinion, not a proof

When to Use Which Approach¶

Scenario	Recommended Approach
Production lineage for data governance	DFA — deterministic, column-level, auditable
Quick assessment of unfamiliar SQL codebase	LLM — handles any syntax, gives semantic context
Validating DFA accuracy on new SQL patterns	Both — use LLM as oracle, DFA as system under test
Lineage through dbt/Jinja templates	LLM — DFA cannot parse templates
Lineage for compliance/audit trails	DFA — reproducible, machine-readable output
Understanding why data flows a certain way	LLM — can explain business logic and intent
Processing 10,000+ SQL files	DFA — scales to any size with batch/parallel processing
Debugging a specific query's lineage	DFA with `/showCaseWhenAsDirect` and `/env metadata.json` — precise column-level detail

Configuration Impact¶

Some DFA options significantly affect lineage completeness:

Option	Impact
`/showCaseWhenAsDirect`	Resolves all CASE WHEN condition column misses
`/showConstant`	Makes literal value flows visible
`/env metadata.json`	Improves orphan column resolution via external schema metadata
`/lof`	Links unresolved columns to first table (fallback heuristic)
`/treatArgumentsInCountFunctionAsDirectDataflow`	Treats aggregate function arguments as direct dataflow

See Configuration Options for full details.