Package gudusoft.gsqlparser
Class TCustomLexer
Object
gudusoft.gsqlparser.TCustomLexer
- Direct Known Subclasses:
TLexerAccess,TLexerAnsi,TLexerathena,TLexerBigquery,TLexerClickhouse,TLexerCouchbase,TLexerDatabricks,TLexerDax,TLexerDb2,TLexerGaussDB,TLexerGreenplum,TLexerHana,TLexerHive,TLexerImpala,TLexerInformix,TLexerMdx,TLexerMssql,TLexerMysql,TLexerNetezza,TLexerOdbc,TLexerOpenedge,TLexerOracle,TLexerPostgresql,TLexerPresto,TLexerRedshift,TLexerSnowflake,TLexerSoql,TLexerSparksql,TLexerSybase,TLexerTeradata,TLexerVertica
Base lexer of all databases - Core tokenization engine for SQL parsing.
The lexer reads SQL text character by character and produces tokens that represent
the syntactic units of SQL. This process involves several key components and stages:
1. Input Management and Buffering
- yyinput (BufferedReader): Primary input source for SQL text
- yyline (char[]): Current line buffer read from input via readln()
- buf (char[]): Reversed line buffer for character-by-character processing
- bufptr: Current position in buf, decrements as characters are consumed
2. Token Text Formation Process
SQL Input → readln() → yyline[] → reversed into buf[] → get_char() → yytextbuf[]
↓
yylex() processing
↓
yylvalstr (String)
↓
TSourceToken.astext
Key Variables in Token Text Storage:
- yytextbuf (char[]): Accumulator buffer for current token being formed
- yytextlen: Current length of text in yytextbuf
- yytextbufsize: Allocated size of yytextbuf (dynamically grows)
- yylvalstr (String): Final token text string created from yytextbuf
- literalbuf (StringBuilder): Special buffer for string literals and complex tokens
3. Position Tracking System
The lexer maintains precise position information for every token:- yylineno: Current line number (1-based)
- yycolno: Current column number (0-based)
- offset: Absolute character offset from start of input
- yylineno_p, yycolno_p, offset_p: Previous position values for token start
4. Token Creation Workflow
- Characters are read via get_char() from buf[] into yytextbuf[]
- yylex() identifies token boundaries and type
- Token text is extracted: yylvalstr = new String(yytextbuf, 0, yytextlen)
- yylexwrap() creates TSourceToken with:
- astext = yylvalstr (full token text copy)
- lineNo = yylineno_p (start line)
- columnNo = yycolno_p (start column)
- offset = offset_p (absolute position)
5. Memory Management and Text Copying
Current Implementation (Eager Loading):- Every token immediately copies its text from yytextbuf to TSourceToken.astext
- Original SQL text in yyline is discarded after processing each line
- No direct link maintained between token and original input position
6. Tracing Back to Original Position
Currently Possible:- Token stores lineNo, columnNo, and offset
- These can theoretically locate position in original input
- Original input text is not retained after line processing
- yyline buffer is overwritten for each new line
- No mechanism to retrieve original text from position alone
- Author:
- Gudu Software
-
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final intstatic intstatic intstatic intstatic intstatic intstatic intstatic intcharbooleanbooleanstatic intstatic intstatic intstatic intstatic intstatic intbooleanlong[][]static final intstatic final int -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic EKeywordTypegetKeywordType(String keyword, HashMap<String, Integer> keywordValueList, HashMap<Integer, Integer> keywordTypeList) Deprecated., please use keywordChecker.isKeyword() instead.intgetkeywordvalue(String keyword) getStringByCode(int tokenCode) booleanintprotected booleanisKeyword(int tokenCode) Check if token code represents a keywordprotected booleanisSingleCharOperator(int tokenCode) Check if token code represents a single character operatorvoidreset()voidReset TOKEN_TABLE by only clearing entries that were used (incremental clear).voidsetSqlCharset(String sqlCharset) voidsetTokenTableValue(TSourceToken token) intyylexwrap(TSourceToken psourcetoken)
-
Field Details
-
MAX_TOKEN_SIZE
-
MAX_TOKEN_COLUMN_SIZE
-
COLUMN0_COUNT
-
COLUMN1_FIRST_X
-
COLUMN2_FIRST_Y
-
COLUMN3_LAST_X
-
COLUMN4_LAST_Y
-
COLUMN5_FIRST_POS
-
COLUMN6_LAST_POS
-
TOKEN_TABLE
-
yyinput
-
dolqstart
-
insqlpluscmd
-
keyword_type_reserved
-
keyword_type_keyword
-
keyword_type_identifier
-
keyword_type_column
-
delimiterchar
-
defaultDelimiterStr
-
tmpDelimiter
-
bconst
- See Also:
-
xconst
- See Also:
-
UNICODE_ENCODE_ID
- See Also:
-
insideSingleQuoteStr
-
stringLiteralStartWithUnicodeSingleQuote
-
-
Constructor Details
-
TCustomLexer
public TCustomLexer()
-
-
Method Details
-
resetTokenTable
Reset TOKEN_TABLE by only clearing entries that were used (incremental clear). This is O(usedTokenCount) instead of O(MAX_TOKEN_SIZE * MAX_TOKEN_COLUMN_SIZE). For typical SQL with ~100 distinct token types, this saves clearing ~20,000 entries. -
setTokenTableValue
-
setSqlCharset
-
getSqlCharset
-
isSingleCharOperator
Check if token code represents a single character operator -
isKeyword
Check if token code represents a keyword -
iskeyword
-
isAtBeginOfLine
-
getStringByCode
-
getkeywordvalue
-
getKeywordType
public static EKeywordType getKeywordType(String keyword, HashMap<String, Integer> keywordValueList, HashMap<Integer, Integer> keywordTypeList) Deprecated., please use keywordChecker.isKeyword() instead. because there are so many non-reserved keywords in some databases, it's not suitable to put those non-reserved keywords in lexer and parser.- Parameters:
keyword-keywordValueList-keywordTypeList-- Returns:
-
reset
-
yylexwrap
-