General SQL Parser Architecture¶

Overview¶

The General SQL Parser (GSP) is a comprehensive Java-based SQL parsing library that supports 25+ database vendors. It follows a multi-layered architecture designed for extensibility, accuracy, and performance in parsing complex SQL statements across different database dialects.

High-Level Architecture¶

graph TB
    subgraph "Input Layer"
        A[SQL Text/File/Stream]
    end

    subgraph "Lexical Analysis Layer"
        B[Database-Specific Lexers]
        B1[TLexerOracle]
        B2[TLexerMssql]
        B3[TLexerPostgresql]
        B4[TLexer... 25+ databases]
    end

    subgraph "Syntax Analysis Layer"
        C[Database-Specific Parsers]
        C1[TParserOracle]
        C2[TParserMssql]
        C3[TParserPostgresql]
        C4[TParser... 25+ databases]
    end

    subgraph "AST Layer"
        D[Abstract Syntax Tree]
        D1[Statement Nodes]
        D2[Expression Nodes]
        D3[Table Reference Nodes]
        D4[Clause Nodes]
    end

    subgraph "Semantic Analysis Layer"
        E[Resolvers & Analyzers]
        E1[Name Resolution]
        E2[Type Resolution]
        E3[Scope Analysis]
        E4[Data Flow Analysis]
    end

    subgraph "Output Layer"
        F[Applications]
        F1[SQL Formatting]
        F2[Data Lineage]
        F3[SQL Translation]
        F4[Impact Analysis]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8

Core Components¶

1. Main Parser Engine (`TGSqlParser`)¶

The TGSqlParser class serves as the primary entry point and orchestrator for the entire parsing process.

Key Responsibilities: - Database vendor detection and parser selection - Input source management (text, file, stream) - Tokenization coordination - Parse tree construction - Error handling and reporting

Usage Pattern:

TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.setSqltext("SELECT * FROM employees WHERE dept = 'IT'");
int result = parser.parse();
if (result == 0) {
    // Parse successful - AST available
    TStatementList statements = parser.getSqlstatements();
} else {
    // Parse failed - check errors
    String errors = parser.getErrormessage();
}

2. Database-Specific Lexers¶

Each database vendor has a dedicated lexer for tokenization that understands vendor-specific syntax.

Architecture Pattern:

TCustomLexer (Base Class)
├── TLexerOracle
├── TLexerMssql
├── TLexerPostgresql
├── TLexerMysql
├── TLexerBigquery
└── ... (25+ database lexers)

Features: - Vendor-specific keyword recognition - Database-specific literal formats - Custom operator handling - Comment syntax variations

3. Database-Specific Parsers¶

Grammar-based parsers that understand the complete SQL syntax for each database vendor.

Parser Hierarchy:

TCustomParser (Base Class)
├── TParserOracle (18,118 lines)
├── TParserMssql (20,336 lines)
├── TParserPostgresql (20,299 lines)
├── TParserTeradata (23,073 lines)
└── ... (25+ database parsers)

Capabilities: - Complete SQL grammar coverage - Vendor-specific extensions - Stored procedure parsing - DDL/DML/DCL statement support

4. Abstract Syntax Tree (AST) System¶

The AST system provides a unified representation of SQL statements across all database vendors.

Node Hierarchy¶

graph TD
    A[TParseTreeNode] --> B[TCustomSqlStatement]
    A --> C[TExpression]
    A --> D[TTable]
    A --> E[TObjectName]
    A --> F[TResultColumn]

    B --> B1[TSelectSqlStatement]
    B --> B2[TInsertSqlStatement]
    B --> B3[TUpdateSqlStatement]
    B --> B4[TDeleteSqlStatement]
    B --> B5[TCreateTableSqlStatement]

    C --> C1[Function Calls]
    C --> C2[Arithmetic Operations]
    C --> C3[Logical Conditions]
    C --> C4[Subqueries]

    D --> D1[Base Tables]
    D --> D2[Derived Tables]
    D --> D3[Table Functions]
    D --> D4[JOIN Expressions]

    style A fill:#ffeb3b
    style B fill:#4caf50
    style C fill:#2196f3
    style D fill:#ff9800

Key Node Types¶

Node Type	Purpose	Key Classes
Statements	SQL statement representations	`TSelectSqlStatement`, `TInsertSqlStatement`, `TUpdateSqlStatement`
Expressions	SQL expressions and conditions	`TExpression`, `TFunctionCall`, `TConstant`
Tables	Table references and sources	`TTable`, `TJoinExpr`, `TFromClause`
Names	Identifiers and object names	`TObjectName`, `TObjectNameList`
Clauses	SQL clause components	`TWhereClause`, `TOrderByClause`, `TGroupByClause`

5. Semantic Analysis System¶

Post-parsing analysis components that provide advanced SQL understanding.

Resolver Components¶

graph LR
    A[Raw AST] --> B[Name Resolution]
    B --> C[Type Resolution]
    C --> D[Scope Analysis]
    D --> E[Data Flow Analysis]
    E --> F[Enriched AST]

    subgraph "Resolver Classes"
        G[TAttributeResolver]
        H[TRelationResolver]
        I[TSQLResolver]
        J[TMetadataCollector]
    end

    B -.-> G
    C -.-> H
    D -.-> I
    E -.-> J

Resolver Capabilities: - Name Resolution: Links column references to their source tables - Type Resolution: Determines data types of expressions - Scope Analysis: Handles nested scopes and visibility - Metadata Integration: Connects with external schema information

6. SQL Environment System (`sqlenv`)¶

Manages metadata and schema information for enhanced analysis.

Components: - TSQLEnv: Main environment container - TSQLTable: Table metadata representation - TSQLColumn: Column metadata representation - TSQLSchema: Schema/database organization

Integration Pattern:

TSQLEnv sqlEnv = new TSQLEnv(EDbVendor.dbvoracle);
// Load metadata from database or DDL
sqlEnv.addTable("employees", columns);

TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.setSqlEnv(sqlEnv);
// Parser now has access to schema information

Processing Flow¶

Phase 1: Lexical Analysis (Tokenization)¶

flowchart LR
    A[SQL Text] --> B[Character Stream]
    B --> C[Token Recognition]
    C --> D[Token Classification]
    D --> E[TSourceTokenList]

    subgraph "Token Types"
        F[Keywords]
        G[Identifiers]
        H[Literals]
        I[Operators]
        J[Whitespace]
        K[Comments]
    end

    D --> F
    D --> G
    D --> H
    D --> I
    D --> J
    D --> K

Key Features: - Database-specific keyword recognition - Unicode support for international characters - Position tracking for error reporting - Comment preservation for formatting

Phase 2: Syntax Analysis (Parsing)¶

flowchart TD
    A[Token Stream] --> B[Grammar Rules]
    B --> C[Recursive Descent Parsing]
    C --> D{Syntax Valid?}
    D -->|Yes| E[Build AST Nodes]
    D -->|No| F[Generate Syntax Error]
    E --> G[Complete AST]
    F --> H[Error Recovery]
    H --> I[Continue Parsing]
    I --> C

Parse Strategy: - Recursive Descent: Top-down parsing approach - Error Recovery: Continues parsing after syntax errors - Multi-Statement: Handles scripts with multiple statements - Stored Procedures: Supports complex procedural SQL

Phase 3: Semantic Analysis¶

flowchart LR
    A[Raw AST] --> B[Name Resolution]
    B --> C[Type Checking]
    C --> D[Scope Validation]
    D --> E[Relationship Building]
    E --> F[Enriched AST]

    subgraph "Analysis Types"
        G[Table-Column Links]
        H[Function Validation]
        I[Data Type Inference]
        J[Dependency Tracking]
    end

    B --> G
    C --> H
    D --> I
    E --> J

Database Vendor Support¶

Multi-Vendor Architecture¶

The GSP architecture is designed for extensive database support through a plugin-like approach:

Category	Databases	Key Features
Traditional RDBMS	Oracle, SQL Server, MySQL, PostgreSQL, DB2	Complete SQL standard support
Cloud Data Warehouses	BigQuery, Redshift, Snowflake, Databricks	Modern SQL extensions
Big Data	Hive, Spark SQL, Impala, Presto	Distributed query support
Specialized	Teradata, Vertica, Netezza, SAP HANA	Vendor-specific optimizations

Vendor-Specific Features¶

// Example: Database-specific syntax handling
switch (parser.getDbVendor()) {
    case dbvoracle:
        // Oracle-specific: CONNECT BY, ROWNUM, etc.
        break;
    case dbvmssql:
        // SQL Server-specific: TOP, OUTPUT, etc.
        break;
    case dbvpostgresql:
        // PostgreSQL-specific: LATERAL, UNNEST, etc.
        break;
    case dbvbigquery:
        // BigQuery-specific: ARRAY, STRUCT, etc.
        break;
}

Data Lineage and Flow Analysis¶

DataFlowAnalyzer Architecture¶

graph TB
    subgraph "Input Analysis"
        A[SQL Statements]
        B[Metadata Schema]
        C[Configuration]
    end

    subgraph "Processing Engine"
        D[Statement Analysis]
        E[Column Mapping]
        F[Transformation Tracking]
        G[Dependency Resolution]
    end

    subgraph "Output Generation"
        H[Column Lineage]
        I[Table Dependencies]
        J[Data Flow Graph]
        K[Impact Analysis]
    end

    A --> D
    B --> E
    C --> F

    D --> H
    E --> I
    F --> J
    G --> K

    style D fill:#e3f2fd
    style H fill:#f1f8e9

Capabilities: - Column-Level Lineage: Traces data flow to individual columns - Cross-Database: Handles ETL across different database vendors - Complex Transformations: Supports joins, aggregations, window functions - View Resolution: Expands views to underlying table references

Visitor Pattern Implementation¶

The GSP uses the Visitor pattern for AST traversal and analysis:

public class CustomAnalysisVisitor extends TParseTreeVisitor {
    @Override
    public void preVisit(TSelectSqlStatement stmt) {
        // Called before visiting SELECT statement children
        analyzeSelectStatement(stmt);
    }

    @Override
    public void preVisit(TTable table) {
        // Called for each table reference
        recordTableUsage(table);
    }

    @Override
    public void preVisit(TExpression expr) {
        // Called for each expression
        analyzeExpression(expr);
    }
}

// Usage
CustomAnalysisVisitor visitor = new CustomAnalysisVisitor();
statement.acceptChildren(visitor);

Advantages¶

1. Comprehensive Database Support¶

25+ Database Vendors: Broadest coverage in the industry
Accurate Parsing: Vendor-specific grammars ensure high accuracy
Regular Updates: Keeps pace with database evolution

2. Rich AST Representation¶

Detailed Structure: Preserves all SQL syntax elements
Type Safety: Strong typing for all node types
Extensible: Easy to add new node types

3. Advanced Analysis Capabilities¶

Semantic Analysis: Goes beyond syntax to understand meaning
Data Lineage: Column-level tracking across complex transformations
Impact Analysis: Determines effects of schema changes

4. Performance Optimization¶

Efficient Parsing: Optimized recursive descent parsing
Memory Management: Careful object lifecycle management
Parallel Processing: Support for concurrent analysis

5. Enterprise Features¶

Error Recovery: Continues parsing after syntax errors
Unicode Support: International character handling
API Stability: Backward compatibility maintenance

Disadvantages¶

1. Complexity¶

Large Codebase: 25+ database parsers create significant complexity
Learning Curve: Extensive API requires time to master
Memory Usage: Large AST structures can consume significant memory

2. Maintenance Overhead¶

Multiple Grammars: Each database requires separate grammar maintenance
Vendor Updates: Must track changes across all supported databases
Testing Complexity: Requires comprehensive test coverage for all vendors

3. Performance Considerations¶

Parse Time: Complex SQL can take significant time to parse
Memory Footprint: Large SQL scripts generate substantial AST structures
Cold Start: Initial parser creation has setup overhead

4. Licensing and Cost¶

Commercial License: Not free for commercial use
Feature Restrictions: Some advanced features require higher license tiers
Support Dependency: Complex issues may require vendor support

Use Case Patterns¶

1. SQL Analysis and Validation¶

// Parse and validate SQL syntax
TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.setSqltext(sqlScript);
if (parser.parse() == 0) {
    // SQL is syntactically correct
    analyzeSqlComplexity(parser.getSqlstatements());
} else {
    // Report syntax errors
    reportSyntaxErrors(parser.getSyntaxErrors());
}

2. Data Lineage Analysis¶

// Trace data lineage
DataFlowAnalyzer analyzer = new DataFlowAnalyzer(
    sqlScript, 
    EDbVendor.dbvoracle, 
    false
);
analyzer.generateDataFlow();
for (TColumn column : analyzer.getColumns()) {
    System.out.println("Column: " + column.getDisplayName());
    System.out.println("Sources: " + column.getSourceColumns());
}

3. SQL Transformation¶

// Transform SQL between databases
TGSqlParser sourceParser = new TGSqlParser(EDbVendor.dbvoracle);
sourceParser.setSqltext(oracleSql);
sourceParser.parse();

// Modify AST for target database
SQLTransformer transformer = new SQLTransformer(
    EDbVendor.dbvoracle, 
    EDbVendor.dbvpostgresql
);
String postgresqlSql = transformer.transform(sourceParser.getSqlstatements());

4. Schema Impact Analysis¶

// Analyze impact of schema changes
SchemaImpactAnalyzer analyzer = new SchemaImpactAnalyzer();
analyzer.addSqlScripts(applicationSqlFiles);
analyzer.addSchemaChange("ALTER TABLE employees DROP COLUMN old_field");

ImpactReport report = analyzer.analyze();
for (ImpactItem item : report.getImpactedQueries()) {
    System.out.println("Affected query: " + item.getQuery());
    System.out.println("Impact: " + item.getImpactDescription());
}

Best Practices¶

1. Parser Configuration¶

// Configure parser for optimal performance
TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.setSqlCharset("UTF-8");  // Specify charset for proper Unicode handling

// For large files, use streaming
parser.setSqlInputStream(new FileInputStream(largeSqlFile));

// Enable specific parsing features as needed
if (needsDataLineage) {
    parser.enableDataLineageAnalysis();
}

2. Memory Management¶

// Process large SQL scripts efficiently
try (TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle)) {
    parser.setSqltext(largeSqlScript);

    if (parser.parse() == 0) {
        // Process statements one by one to minimize memory usage
        for (TCustomSqlStatement stmt : parser.getSqlstatements()) {
            processStatement(stmt);
            // Clear processed statement to free memory
            stmt.clearParseTree();
        }
    }
}

3. Error Handling¶

// Comprehensive error handling
TGSqlParser parser = new TGSqlParser(EDbVendor.dbvoracle);
parser.setSqltext(sqlScript);

int parseResult = parser.parse();
if (parseResult != 0) {
    // Handle syntax errors
    for (TSyntaxError error : parser.getSyntaxErrors()) {
        System.err.printf("Line %d, Column %d: %s%n",
            error.getLineNo(),
            error.getColumnNo(),
            error.getErrorMessage()
        );
    }
} else {
    // Validate semantic correctness
    SemanticValidator validator = new SemanticValidator(parser);
    if (!validator.validate()) {
        // Handle semantic errors
        for (SemanticError error : validator.getErrors()) {
            System.err.println("Semantic error: " + error.getMessage());
        }
    }
}

Conclusion¶

The General SQL Parser represents a sophisticated and comprehensive solution for SQL analysis across multiple database vendors. Its multi-layered architecture provides both flexibility and accuracy, making it suitable for enterprise applications requiring deep SQL understanding.

While the complexity and licensing costs may be barriers for some use cases, the breadth of database support and advanced analytical capabilities make it a powerful tool for applications involving SQL migration, data lineage analysis, impact assessment, and cross-database SQL development.

The architecture's emphasis on extensibility and accuracy positions it well for handling the evolving landscape of SQL databases and the increasing complexity of modern data processing workflows.