TechNotes



Lexer Notes

The lexer (llex.lua) is a version of the native 5.1.x lexer from Yueliang 0.4.0, with significant modifications. It does have several limitations:

Instead of returning one token on each call, llex.lua processes an entire string (all data from an entire file) and returns. Two lists (tokens and semantic information items) are set up in the module for use by the caller.

For maximum flexibility during processing, the lexer returns non-grammar lexical elements as tokens too. Non-grammar elements, such as comments, whitespace, line endings, are classified along with 'normal' tokens. The lexer classifies 7 kinds of grammar tokens and 4 kinds of non-grammar tokens, as follows:

Grammar Token
Description
TK_KEYWORD
keywords
TK_NAME
identifiers
TK_NUMBER
numbers (unconverted, kept in original form)
TK_STRING
strings (no translation is done, includes delimiters)
TK_LSTRING
long strings (no translation is done, includes delimiters)
TK_OP
operators and punctuation (most single-char, some double)
TK_EOS
end-of-stream (there is only one for each file/stream)

Whitespace Token
Description
TK_SPACE
whitespace (generally, spaces, \t, \v and \f)
TK_COMMENT
comments (includes delimiters, also includes special first line shbang, which is handled specially in the optimizer)
TK_LCOMMENT
block comments (includes delimiters)
TK_EOL
end-of-lines (excludes those embedded in strings)

A list of tokens can be generated by using the --dump-lexer option, like this:

lua LuaSrcDiet.lua --dump-lexer llex.lua > dump_llex.dat


Lexer Optimizations

We aim to keep lexer-based optimizations free of parser considerations, i.e. we allow for generalized optimization of token sequences. The table below considers the requirements for all combinations of significant tokens (except TK_EOS). Other tokens are whitespace-like. Comments can be considered to be a special kind of whitespace, e.g. a short comment needs to have a following EOL token, if we do not want to optimize away short comments.


Second Token





First Token
Keyword
Name
Number
String
LString
Oper
Keyword
[S]
[S]
[S]
-
-
-
Name
[S]
[S]
[S]
-
-
-
Number
[S]
[S]
[S]
-
-
[1]
String
-
-
-
-
-
-
LString
-
-
-
-
-
-
Oper
-
-
[1]
-
-
[2]

A dash ('-') in the above means that the first token can abut the second token.

[S] Need at least one whitespace, set as either a space or kept as an EOL.
[1] Need a space if operator is a '.', all others okay. A '+' or '-' is used as part of a floating-point spec, but there does not appear to be any way of creating a float by joining with number with a a '+' or '-' plus another number. Since an 'e' has to be somewhere in the first token, this can't be done.
[2] Normally there cannot be consecutive operators, but we plan to allow for generalized optimization of token sequences, i.e. even sequences that are grammatically illegal; so disallow adjacent operators if:
Also, a minus '-' cannot preceed a Comment or LComment, because comments start with a '--' prefix. Apart from that, all Comment or LComment tokens can be set abut with a real token.


Local Variable Renaming

The following discusses the problem of local variable optimization, specifically local variable renaming in order to reduce source code size.

TK_NAME Token Considerations

A TK_NAME token means a number of things, and some of these cannot be renamed without analyzing the source code. We are interested in the use of TK_NAME in the following:

(a) global variable access
(b) local variable declaration, including local statements, local functions, function parameters, implicit "self" locals
(c) local variable access, including upvalue access
TK_NAME is also used in parts of the grammar as constant strings -- these tokens cannot be optimized without user assistance. These include usage as:

(d) keys in key=value pairs in table construction
(e) field or method names in a:b or a.b syntax forms
For the local variable name optimization scheme used, we do not consider (d) and (e), and while global variables cannot be renamed without some kind of user assistance, they need to be considered or tracked as part of Lua's variable access scheme.

Lifetime of a Local Variable

Consider the following example:

local string, table = string, table

In the example, the two locals are assigned the values of the globals with the same names. When Lua encounters the declaration portion:

local string, table

the parser cannot immediately make the two local variable available to following code. In the parser and code generator, locals are inactive when entries are created. They are activated only when the function adjustlocalvars() is called to activate the appropriate local variables. (Note that the terminology used here may not be identical to the ones used in the Dragon Book -- they merely follow the LuaSrcDiet code as it was written before I have read the Dragon Book.)

In the example, the two local variables are activated only after the whole statement has been parsed, that is, after the last "table" token. Hence, the statement works as expected. Also, once the two local variables goes out of scope, removevars() is called to deactivate them, allowing other variables of the same name to become visible again.

Another example worth mentioning is:

local a, a, a, = 1, 2, 3

The above will assign 3 to 'a'.

Thus, when optimizing local variable names, (1) we need to consider accesses of global variable names affecting the namespace, (2) for the local variable names themselves, we need to consider when they are declared, activated and removed, and (3) within the 'live' time of locals, we need to know when they are accessed (since locals that are never accessed don't really matter.)

Local Variable Tracking

Every local variable declaration is considered an object to be renamed.

From the parser, we have the original name of the local variable, the token positions for declaration, activation and removal, and the token position for all the TK_NAME tokens which references this local. All instances of the implicit "self" local variable are also flagged as such.

In addition to local variable information, all global variable accesses are tabled, one object entry for one name, and each object has a corresponding list of token positions for the TK_NAME tokens, which is where the global variables were accessed.

The key criteria is: Our act of renaming cannot change the visibility of any of these locals and globals at the time they are accessed. However, their scope of visibility may be changed during which they are not accessed, so someone who tries to insert a variable reference somewhere into a program that has its locals renamed may find that it now refers to a different variable.

Of course, if every variable has a unique name, then there is no need for a name allocation algorithm, as there will be no conflict. But, in order to maximize utilization of short identifier names to reduce the final code size, we want to reuse the names as much as possible. In addition, fewer names will likely reduce symbol entropy and may slightly improve compressibility of the source code. LuaSrcDiet avoids the use of non-ASCII letters, so there are only 53 single-character variable names.

Name Allocation Theory

To understand the renaming algorithm, first we need to establish how different local and global variables can operate happily without interfering with each other.

Consider three objects, local object A, local object B and global object G. A and B involve declaration, activation and removal, and within the period it is active, there may be zero or more accesses of the local. For G, there are only global variable accesses to look into.

Assume that we have assigned a new name to A and we wish to consider its effects on other locals and globals, for which we choose B and G as examples. We assume local B has not been assigned a new name as we expect our algorithm to take care of collisions.

A's lifetime is something like this:

        Decl            Act                             Rem
        +               +-------------------------------+
        -------------------------------------------------

where 'Decl' is the time of declaration, 'Act' is the time of activation, and 'Rem' is the time of removal. Between 'Act' and 'Rem', the local is alive or 'live' and Lua can see it if its corresponding TK_NAME identifier comes up.

        Decl            Act                             Rem
        +               +-------------------------------+
        -------------------------------------------------
   *            *             *                              *
  (1)          (2)           (3)                            (4)

Recall that the key criteria is to not change the visibility of globals and locals during when they are accessed. Consider local and global accesses at (1), (2), (3) and (4).

A global G of the same name as A will only collide at (3), where Lua will see A and not G. Since G must be accessed at (3) according to what the parser says, and we cannot modify the positions of 'Decl', 'Act' and 'Rem', it follows that A cannot have the same name as G.

                Decl    Act                     Rem
                +       +-----------------------+
                ---------------------------------
 (1)+   +---+   (2)+   +---+      (3)+   +---+	    (4)+   +---+
    ---------      ---------         ---------         ---------

For the case of A and B having the same names and colliding, consider the cases for which B is at (1), (2), (3) or (4) in the above.

(1) and (4) means that A and B are completely isolated from each other, hence in the two cases, A and B can safely use the same variable names. To be specific, since we have assigned A, B is considered completely isolated from A if B's activation-to-removal period is isolated from the time of A's first access to last access, meaning B's active time will never affect any of A's accesses.

For (2) and (3), we have two cases where we need to consider which one has been activated first. For (2), B is active before A, so A cannot impose on B. But A's accesses are valid while B is active, since A can override B. For no collision in the case of (2), we simply need to ensure that the last access of B occurs before A is activated.

For (3), B is activated before A, hence B can override A's accesses. For no collision, all of A's accesses cannot happen while B is active. Thus position (3) follows the "A is never accessed when B is active" rule in a general way. Local variables of a child function are in the position of (3). To illustrate, the local B can use the same name as local A and live in a child function or block scope if each time A is accessed, Lua sees A and not B. So we have to check all accesses of A and see whether they collide with the active period of B. If A is not accessed during that period, then B can be active with the same name.

The above appears to resolve all sorts of cases where the active times of A and B overlap. Note that in the above, the allocator does not need to know how locals are separated according to function prototypes. Perhaps the allocator can be simplified if knowledge of function structure is utilized. This scheme was implemented in a hurry in 2008 -- it could probably be simpler if Lua grammar is considered, but LuaSrcDiet mainly processes various index values in tables.

Name Allocation Algorithm

To begin with, the name generator is mostly separate from the name allocation algorithm. The name generator returns the next shortest name for the algorithm to apply to local variables. To attempt to reduce symbol entropy (which benefit compression algorithms), the name generator follows English frequent letter usage. There is also an option to calculate an actual symbol entropy table from the input data.

Since there are 53 one-character identifiers and (53*63-4) two-character identifiers (minus a few keywords), there isn't a pressing need to optimally maximize name reuse. The single-file version of LuaSrcDiet 0.12.0, at just over 3000 SLOC and 156KB in size, currently allocates around 55 unique local variable names.

In theory, we should need no more than 260 local identifiers by default. Why? Since LUAI_MAXVARS is 200 and LUAI_MAXUPVALUES is 60, at any block scope, there can be at most (LUAI_MAXVARS + LUAI_MAXUPVALUES) locals referenced, or 260. Also, those from outer scopes not referenced in inner scopes can reuse identifiers. The net effect of this is that a local variable name allocation method should not allocate more than 260 identifier names for locals.

The current algorithm is a simple first-come first-served scheme:

(a) One local object that use the most tokens is named first.
(b) Any other non-conflicting locals with respect to the first object are assigned the same name.
(c) Assigned locals are removed from consideration and the procedure is repeated for objects that have not been assigned new names.
(d) Steps (a) to (c) repeats until no local objects are left.
In addition, there are a few extra issues to take care of:

(e) Implicit "self" locals that have been flagged as such are already "assigned to" and so they are left unmodified.
(f) The name generator skips "self" to avoid conflicts. This is not optimal but it is unlikely a script will use so many local variables as to reach "self".
(g) Keywords are also skipped for the name generator.
(h) Global name conflict resolution.
For (h), global name conflict resolution is handled just after the new name is generated. The name can still be used for some locals even if it conflicts with other locals. To remove conflicts, global variable accesses for the particular identifier name is checked. Any local variables that are active when a global access is made is marked to be skipped. The rest of the local objects can then use that name.

The algorithm has additional code for handling locals that use the same name in the same scope. This extends the basic algorithm that was discussed earlier. For example:

local foo = 10 -- (1)
...
local foo = 20 -- (2)
...
print(e)

Since we are considering name visibility, the first 'foo' does not really cease to exist when the second 'foo' is declared, because if we were to make that assumption, and the first 'foo' is removed before (2), then I should be able to use 'e' as the name for the first 'foo' and after (2), it should not conflict with variables in the outer scope with the same name. To illustrate:

local e = 10 -- 'foo ' renamed to 'e'
...
local t = 20 -- error if we assumed 'e' removed here
...
print(e)

Since 'e' is a global in the example, we now have an error as the name as been taken over by a local. Thus, the first 'foo' local must have its active time extend to the end of the current scope. If there is no conflict between the first and second 'foo', the algorithm may still assign the same names to them.

The current fix to deal with the above chains local objects in order to find the removal position. It may be possible to handle this in a clean manner -- LuaSrcDiet handles it as a fix to the basic algorithm.


Ideas

The following is a list of optimization ideas that do not require heavy-duty source code parsing and comprehension.

Lexer-Based Optimization Ideas

A little desperate for a few bytes, can be done, but not real keen on implementing it.
For example, 65536 can be represented using 2^16, and so on. An expression must be evaluated in the same way, otherwise this seems unsafe.
Should we warn or "test and truncate"? Not really an optimization that will see much use.
Current recommendation is to use the HTML plugin to display globals in red. The developer can then visually analyze the source code and make the appropriate fixes. I think this is better than having the program guess the intentions of the developer.
For long strings, need to know user's intention. Would rather not implement.

Parser-Based Optimization Ideas

Heavy-duty optimizations will need more data to be generated by the parser. A full AST may eventually be needed. The most attractive idea that can be quickly implemented with a significant code size 'win' is to reduce the number of local keywords.

Need to consider unused locals in multiple assignments.
From:
-- separate locals
local foo
local bar
-- separate locals with assignments
local foo = 123
local bar = "pqr"
To:
-- merged locals
local foo,bar
-- merged locals with assignments
local foo,bar=123,"pqr"
From:
local foo, bar = nil, nil
To:
local foo,bar

How desirable is this? From Lua list discussions, it seems to be potentially unsafe unless all return locations are known and checked.
Yeah, this might save a few bytes.
Not sure if this is safe to do.
This is more suitable for an optimizing compiler project.

2011-09-13 khman