llex.lua
) is a version of the native 5.1.x lexer from
Yueliang 0.4.0, with significant modifications. It does have several
limitations:
[[
...]]
long strings. (No LUA_COMPAT_LSTR
.)
llex.lua
processes an
entire string (all data from an entire file) and returns. Two lists
(tokens and semantic information items) are set up in the module for use
by the caller.
For maximum flexibility during processing, the lexer returns non-grammar lexical elements as tokens too. Non-grammar elements, such as comments, whitespace, line endings, are classified along with 'normal' tokens. The lexer classifies 7 kinds of grammar tokens and 4 kinds of non-grammar tokens, as follows:
Grammar Token |
Description |
TK_KEYWORD |
keywords |
TK_NAME |
identifiers |
TK_NUMBER |
numbers (unconverted, kept in original form) |
TK_STRING |
strings (no translation is done, includes delimiters) |
TK_LSTRING |
long strings (no translation is done, includes delimiters) |
TK_OP |
operators and punctuation (most single-char, some double) |
TK_EOS |
end-of-stream (there is only one for each file/stream) |
Whitespace Token |
Description |
TK_SPACE |
whitespace (generally, spaces, \t, \v and \f) |
TK_COMMENT |
comments (includes delimiters, also includes special first line shbang, which is handled specially in the optimizer) |
TK_LCOMMENT |
block comments (includes delimiters) |
TK_EOL |
end-of-lines (excludes those embedded in strings) |
A list of tokens can be generated by using the --dump-lexer option, like this:
lua LuaSrcDiet.lua --dump-lexer llex.lua > dump_llex.dat
TK_EOS
). Other tokens are whitespace-like. Comments
can be considered to be a special kind of whitespace, e.g. a short
comment needs to have a following EOL token, if we do not want to
optimize away short comments.
|
Second Token |
|
|
|
|
|
First Token |
Keyword |
Name |
Number |
String |
LString |
Oper |
Keyword |
[S] |
[S] |
[S] |
- |
- |
- |
Name |
[S] |
[S] |
[S] |
- |
- |
- |
Number |
[S] |
[S] |
[S] |
- |
- |
[1] |
String |
- |
- |
- |
- |
- |
- |
LString |
- |
- |
- |
- |
- |
- |
Oper |
- |
- |
[1] |
- |
- |
[2] |
A dash ('-') in the above means that the first token can abut the second token.
[S]
Need at least one whitespace, set as either a space or kept as an EOL.
[1]
Need a space if operator is a '.', all others okay. A '+' or '-' is used as part of a floating-point spec, but there does not appear to be any way of creating a float by joining with number with a a '+' or '-' plus another number. Since an 'e' has to be somewhere in the first token, this can't be done.
[2]
Normally there cannot be consecutive operators, but we plan to allow for generalized optimization of token sequences, i.e. even sequences that are grammatically illegal; so disallow adjacent operators if:
The following discusses the problem of local variable optimization, specifically local variable renaming in order to reduce source code size.
TK_NAME
Token ConsiderationsTK_NAME
token means a number of things, and some of these cannot
be renamed without analyzing the source code. We are interested in the
use of TK_NAME
in the following:
local
statements, local
functions, function parameters, implicit "self
" locals
TK_NAME
is also used in parts of the grammar as constant strings --
these tokens cannot be optimized without user assistance. These include
usage as:
key=value
pairs in table construction
a:b
or a.b
syntax forms
local string, table = string, table
In the example, the two locals are assigned the values of the globals with the same names. When Lua encounters the declaration portion:
local string, table
the parser cannot immediately make the two local variable available to
following code. In the parser and code generator, locals are inactive
when entries are created. They are activated only when the function
adjustlocalvars()
is called to activate the appropriate local
variables. (Note that the terminology used here may not be identical to
the ones used in the Dragon Book -- they merely follow the LuaSrcDiet
code as it was written before I have read the Dragon Book.)
In the example, the two local variables are activated only after the
whole statement has been parsed, that is, after the last "table
"
token. Hence, the statement works as expected. Also, once the two local
variables goes out of scope, removevars()
is called to deactivate
them, allowing other variables of the same name to become visible again.
Another example worth mentioning is:
local a, a, a, = 1, 2, 3
The above will assign 3 to 'a'.
Thus, when optimizing local variable names, (1) we need to consider accesses of global variable names affecting the namespace, (2) for the local variable names themselves, we need to consider when they are declared, activated and removed, and (3) within the 'live' time of locals, we need to know when they are accessed (since locals that are never accessed don't really matter.)
From the parser, we have the original name of the local variable, the
token positions for declaration, activation and removal, and the token
position for all the TK_NAME
tokens which references this local. All
instances of the implicit "self
" local variable are also flagged as
such.
In addition to local variable information, all global variable accesses
are tabled, one object entry for one name, and each object has a
corresponding list of token positions for the TK_NAME
tokens, which
is where the global variables were accessed.
The key criteria is: Our act of renaming cannot change the visibility of any of these locals and globals at the time they are accessed. However, their scope of visibility may be changed during which they are not accessed, so someone who tries to insert a variable reference somewhere into a program that has its locals renamed may find that it now refers to a different variable.
Of course, if every variable has a unique name, then there is no need for a name allocation algorithm, as there will be no conflict. But, in order to maximize utilization of short identifier names to reduce the final code size, we want to reuse the names as much as possible. In addition, fewer names will likely reduce symbol entropy and may slightly improve compressibility of the source code. LuaSrcDiet avoids the use of non-ASCII letters, so there are only 53 single-character variable names.
Consider three objects, local object A, local object B and global object G. A and B involve declaration, activation and removal, and within the period it is active, there may be zero or more accesses of the local. For G, there are only global variable accesses to look into.
Assume that we have assigned a new name to A and we wish to consider its effects on other locals and globals, for which we choose B and G as examples. We assume local B has not been assigned a new name as we expect our algorithm to take care of collisions.
A's lifetime is something like this:
Decl Act Rem + +-------------------------------+ -------------------------------------------------
where 'Decl' is the time of declaration, 'Act' is the time of
activation, and 'Rem' is the time of removal. Between 'Act' and 'Rem',
the local is alive or 'live' and Lua can see it if its corresponding
TK_NAME
identifier comes up.
Decl Act Rem + +-------------------------------+ ------------------------------------------------- * * * * (1) (2) (3) (4)
Recall that the key criteria is to not change the visibility of globals and locals during when they are accessed. Consider local and global accesses at (1), (2), (3) and (4).
A global G of the same name as A will only collide at (3), where Lua will see A and not G. Since G must be accessed at (3) according to what the parser says, and we cannot modify the positions of 'Decl', 'Act' and 'Rem', it follows that A cannot have the same name as G.
Decl Act Rem + +-----------------------+ --------------------------------- (1)+ +---+ (2)+ +---+ (3)+ +---+ (4)+ +---+ --------- --------- --------- ---------
For the case of A and B having the same names and colliding, consider the cases for which B is at (1), (2), (3) or (4) in the above.
(1) and (4) means that A and B are completely isolated from each other, hence in the two cases, A and B can safely use the same variable names. To be specific, since we have assigned A, B is considered completely isolated from A if B's activation-to-removal period is isolated from the time of A's first access to last access, meaning B's active time will never affect any of A's accesses.
For (2) and (3), we have two cases where we need to consider which one has been activated first. For (2), B is active before A, so A cannot impose on B. But A's accesses are valid while B is active, since A can override B. For no collision in the case of (2), we simply need to ensure that the last access of B occurs before A is activated.
For (3), B is activated before A, hence B can override A's accesses. For no collision, all of A's accesses cannot happen while B is active. Thus position (3) follows the "A is never accessed when B is active" rule in a general way. Local variables of a child function are in the position of (3). To illustrate, the local B can use the same name as local A and live in a child function or block scope if each time A is accessed, Lua sees A and not B. So we have to check all accesses of A and see whether they collide with the active period of B. If A is not accessed during that period, then B can be active with the same name.
The above appears to resolve all sorts of cases where the active times of A and B overlap. Note that in the above, the allocator does not need to know how locals are separated according to function prototypes. Perhaps the allocator can be simplified if knowledge of function structure is utilized. This scheme was implemented in a hurry in 2008 -- it could probably be simpler if Lua grammar is considered, but LuaSrcDiet mainly processes various index values in tables.
Since there are 53 one-character identifiers and (53*63-4) two-character identifiers (minus a few keywords), there isn't a pressing need to optimally maximize name reuse. The single-file version of LuaSrcDiet 0.12.0, at just over 3000 SLOC and 156KB in size, currently allocates around 55 unique local variable names.
In theory, we should need no more than 260 local identifiers by default.
Why? Since LUAI_MAXVARS
is 200 and LUAI_MAXUPVALUES
is 60, at
any block scope, there can be at most (LUAI_MAXVARS +
LUAI_MAXUPVALUES)
locals referenced, or 260. Also, those from outer
scopes not referenced in inner scopes can reuse identifiers. The net
effect of this is that a local variable name allocation method should
not allocate more than 260 identifier names for locals.
The current algorithm is a simple first-come first-served scheme:
self
" locals that have been flagged as such are already "assigned to" and so they are left unmodified.
self
" to avoid conflicts. This is not optimal but it is unlikely a script will use so many local variables as to reach "self
".
The algorithm has additional code for handling locals that use the same name in the same scope. This extends the basic algorithm that was discussed earlier. For example:
local foo = 10 -- (1) ... local foo = 20 -- (2) ... print(e)
Since we are considering name visibility, the first 'foo
' does not
really cease to exist when the second 'foo
' is declared, because if
we were to make that assumption, and the first 'foo
' is removed
before (2), then I should be able to use 'e
' as the name for the
first 'foo
' and after (2), it should not conflict with variables in
the outer scope with the same name. To illustrate:
local e = 10 -- 'foo ' renamed to 'e' ... local t = 20 -- error if we assumed 'e' removed here ... print(e)
Since 'e
' is a global in the example, we now have an error as the
name as been taken over by a local. Thus, the first 'foo
' local must
have its active time extend to the end of the current scope. If there is
no conflict between the first and second 'foo
', the algorithm may
still assign the same names to them.
The current fix to deal with the above chains local objects in order to find the removal position. It may be possible to handle this in a clean manner -- LuaSrcDiet handles it as a fix to the basic algorithm.
local
to zap a bunch of globals.
local
keywords.
local
s that can be removed in the source.
local
s in multiple assignments.
local
s that can be merged.
-- separate locals local foo local bar -- separate locals with assignments local foo = 123 local bar = "pqr"
-- merged locals local foo,bar -- merged locals with assignments local foo,bar=123,"pqr"
nil
.
local foo, bar = nil, nil
local foo,bar
return
s using nil
.
nil
.