Oracle: DB Sample Schemas¶
| Metric | Count |
|---|---|
| Total lineage found by LLM | 28 |
| Total lineage found by DFA | 27 |
| DFA matches (correct) | 27 |
| DFA crashes (lineage missed) | 1 |
| DFA silent misses | 0 |
| DFA robustness issues | 3 |
| Accuracy | 96% |
Source: Oracle DB Sample Schemas Date: 2026-04-02
Lineage Inventory¶
Views (21 found by both LLM and DFA)¶
| # | View Name | Schema | Source Tables | Type | Pattern |
|---|---|---|---|---|---|
| 1 | emp_details_view | HR | employees, departments, jobs, locations, countries, regions | view | Implicit joins, WHERE conditions |
| 2 | customer_order_products | CO | orders, order_items, customers, products | view | Explicit JOINs, SUM, LISTAGG |
| 3 | store_orders | CO | stores, orders, order_items | view | JOINs, GROUPING SETS, GROUPING_ID |
| 4 | product_orders | CO | orders, order_items, customers, products | view | JOINs, GROUP BY, SUM, COUNT |
| 5 | products | OE | product_information, product_descriptions | view | Outer join (+), CASE, TRANSLATE USING NCHAR_CS |
| 6 | sydney_inventory | OE | inventories, warehouses, products (view) | view | 3-table join, WHERE filter |
| 7 | bombay_inventory | OE | inventories, warehouses, products (view) | view | 3-table join, WHERE filter |
| 8 | toronto_inventory | OE | inventories, warehouses, products (view) | view | 3-table join, WHERE filter |
| 9 | product_prices | OE | product_information | view | GROUP BY, COUNT, MIN, MAX |
| 10 | oc_inventories | OE | inventories, warehouses | object view | OF type, warehouse_typ() constructor |
| 11 | oc_product_information | OE | product_information, oc_inventories | object view | CAST(MULTISET(SELECT...)) |
| 12 | oc_customers | OE | customers, orders, order_items, oc_product_information | object view | Nested CAST(MULTISET), MAKE_REF, self-ref |
| 13 | oc_corporate_customers | OE | customers, orders, order_items, oc_customers, oc_product_information | object view | UNDER subtype, CAST(MULTISET), MAKE_REF |
| 14 | oc_orders | OE | orders, order_items, oc_customers, oc_product_information | object view | CAST(MULTISET), MAKE_REF |
| 15 | account_managers | OE | customers, countries | view | GROUP BY ROLLUP, object attribute access |
| 16 | customers_view | OE | customers, countries | view | Object dot-notation, UDF calls, SDO_GEOMETRY |
| 17 | orders_view | OE | orders | view | TO_DATE(TO_CHAR(...)) |
| 18 | DATABASE_SUMMARY | OE/XDB | v$system_parameter, v$database, v$instance, nls_database_parameters | view | System dynamic performance views |
| 19 | database_summary | OE/XDB | DATABASE_SUMMARY, V$VERSION, ALL_REGISTRY_BANNERS | XMLType view | xmlElement, xmlForest, xmlAgg |
Materialized Views (2 found by both)¶
| # | MV Name | Schema | Source Tables | Pattern |
|---|---|---|---|---|
| 20 | cal_month_sales_mv | SH | sh.sales, sh.times | GROUP BY, SUM, ENABLE QUERY REWRITE |
| 21 | fweek_pscat_sales_mv | SH | sh.sales, sh.times, sh.products | 3-table join, GROUP BY, SUM |
View (1 found by LLM only - DFA crash)¶
| # | View Name | Schema | Source Tables | Pattern | DFA Status |
|---|---|---|---|---|---|
| 22 | product_reviews | CO | products (JSON column) | JSON_TABLE with NESTED PATH, AVG OVER window | CRASH: NullPointerException |
Triggers with Data Flow (5 found by both)¶
| # | Trigger | Schema | Source | Target | Pattern |
|---|---|---|---|---|---|
| 23 | update_job_history | HR | employees (:old) | job_history (via add_job_history proc) | AFTER UPDATE trigger, procedure INSERT |
| 24 | insert_ord_line | OE | order_items (SELECT MAX) | order_items (:new) | BEFORE INSERT trigger, self-referencing |
| 25 | orders_trg | OE | oc_orders (view :NEW) | orders | INSTEAD OF INSERT |
| 26 | orders_items_trg | OE | oc_orders nested table (:NEW) | order_items | INSTEAD OF INSERT, DEREF |
| 27 | no_dml_operations_allowed | OE/XDB | database_summary (view) | (none - NULL body) | INSTEAD OF, no-op |
INSERT...SELECT (1 pattern, 22 statements - found by both)¶
| # | Source | Target | Pattern |
|---|---|---|---|
| 28 | oc_product_information, categories_tab | categories_tab | INSERT VALUES with CAST(MULTISET(SELECT...)) |
Critical Issues (DFA Crashes - Lineage Missed)¶
Issue 1: product_reviews view - JSON_TABLE with NESTED PATH¶
- Target:
product_reviews(view) - Expected lineage:
products->product_reviews(view) - Error:
NullPointerException: Cannot invoke "gudusoft.gsqlparser.nodes.TObjectName.toString()" because "column" is null - Location: DFA analyze sql stmt, line 223, column 1 of co_create.sql
- SQL pattern:
1 2 3 4 5 6 7 8 9 10 11 12 | |
The DFA crashes with NullPointerException when processing the JSON_TABLE function with COLUMNS clause. The column references inside JSON_TABLE COLUMNS clause produce a null TObjectName, causing the crash.
Robustness Issues¶
Issue R1: Sales History OOM on directory-level run¶
- Error:
OutOfMemoryError: Java heap spaceatDataFlowAnalyzer.analyzeAndOutputResult() - Workaround: Running individual files with
-Xmx4gsucceeds - Impact: No lineage missed - all SH views and MVs correctly detected per-file
Issue R2: Product Media JSON parsing errors¶
- Error:
RuntimeException: Unknown JSON value type.inTJSONSQLEnvParser.getJSONSQLEnv() - Impact: The PM schema has no data lineage (only DDL + LOB loading via SQL*Loader), so no lineage missed
Issue R3: xdbConfiguration ALTER View unsupported¶
- Error:
UnsupportedOperationException: Can't handle this alter view statement case.at line 206 - Impact: The ALTER VIEW statement is not a lineage source - the CREATE VIEW statements were correctly processed. No lineage missed.
Side-by-Side: Where Each Approach Excels¶
Summary¶
| Dimension | Winner | Evidence | Impact |
|---|---|---|---|
| Column-level precision | DFA | DFA traces exact column-to-column fdd relationships across all 21 views, including through object type constructors and CAST(MULTISET) | High |
| Multi-source expressions | DFA | DFA correctly traces computed columns like SUM(quantity * unit_price) to both source columns in store_orders and customer_order_products | Medium |
| FK/ER relationships | DFA | DFA extracts FK relationships from ALTER TABLE ADD CONSTRAINT across HR (6 FKs), CO (9 FKs), OE (9+ FKs) | Medium |
| fdd vs fdr distinction | DFA | DFA distinguishes direct dataflow (fdd: SELECT columns) from indirect (fdr: WHERE, JOIN ON, GROUP BY conditions) across all views | Medium |
| Novel/vendor-specific syntax | LLM | LLM correctly analyzed JSON_TABLE with NESTED PATH (product_reviews) where DFA crashed | High |
| Semantic understanding | LLM | LLM identified trigger -> procedure -> INSERT pipeline as a single logical data flow (update_job_history -> add_job_history -> job_history) | Medium |
| Graceful degradation | LLM | DFA crashed on 1 SQL pattern (JSON_TABLE), LLM handled all 28 lineage patterns | Medium |
| Aggregate/string functions | Tie | Both correctly identified LISTAGG, SUM, COUNT, AVG window functions as lineage sources | Low |
| Trigger/procedure bodies | Tie | Both traced trigger data flow through :OLD/:NEW pseudo-records and INSTEAD OF triggers with DEREF | Low |
Detail¶
DFA Advantages in This Codebase¶
-
Column-level precision: DFA traces exact column-to-column mappings. For example, in
emp_details_view, DFA mapsemployees.first_name->emp_details_view.first_nameas a direct fdd relationship, while also trackingemployees.department_id->departments.department_idas an indirect fdr (JOIN condition). The LLM identified the same columns but cannot formally distinguish fdd from fdr. -
FK/ER extraction: DFA extracted all foreign key relationships from DDL across all schemas (e.g.,
employees.department_id->departments.department_id,countries.region_id->regions.region_id). These are structural relationships that the LLM explicitly excluded from its analysis. -
Object view column tracing: DFA traced column lineage through Oracle object type constructors (warehouse_typ, inventory_typ) and CAST(MULTISET) patterns in the complex object-relational views (oc_inventories, oc_product_information, oc_customers, etc.).
-
INSERT...VALUES constant tracing: DFA traced all 88+ INSERT...VALUES statements in HR population data and 86+ in CO population data, mapping literal values to target columns.
LLM Advantages in This Codebase¶
-
JSON_TABLE handling: LLM correctly analyzed the product_reviews view using JSON_TABLE with COLUMNS clause, identifying that
products.product_details(JSON column) feedsratingandreviewvia JSON path expressions$.reviews[*].ratingand$.reviews[*].review. DFA crashed with NullPointerException on this pattern. -
Holistic trigger-procedure chains: LLM identified the complete data flow chain:
employeestable UPDATE ->update_job_historytrigger ->add_job_historyprocedure ->job_historytable INSERT, presenting this as a single logical lineage relationship. DFA found the individual components but required the user to assemble the chain. -
Business context: LLM noted that
customerstable is joined inproduct_ordersview but no customer columns appear in output (unnecessary join), and thatno_dml_operations_allowedis a no-op trigger. DFA processed these without noting the semantic observations.
Tie / Both Correct¶
- Standard views: Both approaches correctly identified all 18 standard/object views with their source tables.
- Materialized views: Both found
profits,cal_month_sales_mv, andfweek_pscat_sales_mvwith correct source tables. - INSTEAD OF triggers: Both correctly traced data flow through the
orders_trgandorders_items_trgINSTEAD OF triggers, including the DEREF pattern. - CAST(MULTISET) INSERT...SELECT: Both identified the 22 INSERT statements into
categories_tabusing CAST(MULTISET(SELECT...FROM oc_product_information)).
Recommendations¶
-
Fix JSON_TABLE support in DFA: The NullPointerException when processing JSON_TABLE COLUMNS clause is a parser/analyzer bug. The column references inside JSON_TABLE should be recognized as derived columns from the JSON source expression.
-
Add ALTER XMLType view support: The
UnsupportedOperationExceptionon ALTER VIEW for XMLType views should be handled gracefully (or supported if possible). -
Increase default heap for directory-level runs: The SH schema OOM at 2GB suggests the default should be 4GB for directories with many INSERT...VALUES statements.
-
Handle non-SQL files gracefully: The JSON parsing error in PM directory is caused by non-SQL files being processed. DFA should skip files that don't parse as SQL rather than throwing exceptions.