Class TCustomLexer

Object
gudusoft.gsqlparser.TCustomLexer
Direct Known Subclasses:
TLexerAccess, TLexerAnsi, TLexerathena, TLexerBigquery, TLexerClickhouse, TLexerCouchbase, TLexerDatabricks, TLexerDax, TLexerDb2, TLexerGaussDB, TLexerGreenplum, TLexerHana, TLexerHive, TLexerImpala, TLexerInformix, TLexerMdx, TLexerMssql, TLexerMysql, TLexerNetezza, TLexerOdbc, TLexerOpenedge, TLexerOracle, TLexerPostgresql, TLexerPresto, TLexerRedshift, TLexerSnowflake, TLexerSoql, TLexerSparksql, TLexerSybase, TLexerTeradata, TLexerVertica

public class TCustomLexer extends Object
Base lexer of all databases - Core tokenization engine for SQL parsing. The lexer reads SQL text character by character and produces tokens that represent the syntactic units of SQL. This process involves several key components and stages:

1. Input Management and Buffering

  • yyinput (BufferedReader): Primary input source for SQL text
  • yyline (char[]): Current line buffer read from input via readln()
  • buf (char[]): Reversed line buffer for character-by-character processing
  • bufptr: Current position in buf, decrements as characters are consumed

2. Token Text Formation Process

 SQL Input → readln() → yyline[] → reversed into buf[] → get_char() → yytextbuf[]
                                                                        ↓
                                                                yylex() processing
                                                                        ↓
                                                                 yylvalstr (String)
                                                                        ↓
                                                            TSourceToken.astext
 

Key Variables in Token Text Storage:

  • yytextbuf (char[]): Accumulator buffer for current token being formed
  • yytextlen: Current length of text in yytextbuf
  • yytextbufsize: Allocated size of yytextbuf (dynamically grows)
  • yylvalstr (String): Final token text string created from yytextbuf
  • literalbuf (StringBuilder): Special buffer for string literals and complex tokens

3. Position Tracking System

The lexer maintains precise position information for every token:
  • yylineno: Current line number (1-based)
  • yycolno: Current column number (0-based)
  • offset: Absolute character offset from start of input
  • yylineno_p, yycolno_p, offset_p: Previous position values for token start

4. Token Creation Workflow

  1. Characters are read via get_char() from buf[] into yytextbuf[]
  2. yylex() identifies token boundaries and type
  3. Token text is extracted: yylvalstr = new String(yytextbuf, 0, yytextlen)
  4. yylexwrap() creates TSourceToken with:
    • astext = yylvalstr (full token text copy)
    • lineNo = yylineno_p (start line)
    • columnNo = yycolno_p (start column)
    • offset = offset_p (absolute position)

5. Memory Management and Text Copying

Current Implementation (Eager Loading):
  • Every token immediately copies its text from yytextbuf to TSourceToken.astext
  • Original SQL text in yyline is discarded after processing each line
  • No direct link maintained between token and original input position

6. Tracing Back to Original Position

Currently Possible:
  • Token stores lineNo, columnNo, and offset
  • These can theoretically locate position in original input
Current Limitations:
  • Original input text is not retained after line processing
  • yyline buffer is overwritten for each new line
  • No mechanism to retrieve original text from position alone
Author:
Gudu Software