Which runs first in bash? lexer, or expander?

139 views Asked by At

I am trying to understand bash's parser and lexer mechanism. (My ultimate goal is implementing a bash-like shell).

The first case

$ test='o a'
$ ech$test
a

(^ edit: I removed double quotes for second line. My actual test case was that.)

The expander expanded the command and found the argument after expanding.
Expanded full command: echo a

So, I can assume that the lexer runs after the expansion operation because the bash understood that "echo a" is not a command name. "echo" is the command name, and "a" is an argument. (btw zsh don't that.)

The second case

$ test="'"
$ echo $test
'

Echo prints only one single quote. However, if we expand this string to: echo ', it is not a valid command because it has an unclosed quote. So, I can assume two things:

  1. At first, the lexer understands what it is and expands after.

  2. Actually, the value of the 'test' variable is not one single quote. Its value is exactly: "'". So, in reality, we don't expand to echo ' we expand to echo "'", which is valid.

But the first assumption and the first test case's assumption do not coincide. So, I assume the second one.

The third case

$ test="'"
$ echo "$test"
'

Echo prints only one single quote again. However, (I assume) it expanded this string to: echo ""'"", which is invalid because we have an unclosed quote.

So, my question is: "How does the bash understand what I mean?"

3

There are 3 answers

2
John Bollinger On BEST ANSWER

The POSIX specification includes a detailed description of the behavior of the shell command language, including considerable detail about how it processes input. It begins with:

The shell shall read its input in terms of lines. (For details about how the shell reads its input, see the description of sh.) The input lines can be of unlimited length. These lines shall be parsed using two major modes: ordinary token recognition and processing of here-documents.

It continues from there with details of how lines are tokenized.

Only after a line has been tokenized can any kind of substitution or expansion be performed, because only at that point can the shell recognize where substitutions and expansions are called for.

The first case

$ test='o a'
$ ech"$test"
a

I don't believe you. I do not reproduce that result in Bash 4.3, nor do I expect to do. The POSIX specifications and the Bash manual are explicit and in agreement on this point: parameter expansions that occur inside double quotes are not subject to word splitting.

With respect to the order of command-line processing, what happens is that ech"$test" is recognized as a single token. Parameter expansion, field splitting, and quote removal apply to that token ("word" in shell jargon), with the overall result that it expands to a single word echo a, which, by virtue of its position in the fully-expanded command line, is interpreted as a command name.

It would be different if you instead did

$ ech$test

where the $test parameter expansion was not quoted. I suspect that's what you actually did in Bash to get the output a. In this case, the expansion of $test is not protected from word splitting, so after expanding the word to the (single) word echo a, that is split into two words at the space. The result is echo as command name and a as its argument.

The second case

$ test="'"
$ echo $test
'

Yes, as the spec describes, quote characters are recognized during token recognition. Quote characters introduced into a command by parameter expansion are significant only as themselves.

  1. Actually, the value of the 'test' variable is not one single quote.

Yes, it is. And this can be tested in a variety of ways, such as (your third case) echo "$test", or echo ${#test}, or (in bash) echo "${test:0:1}".

Its value is exactly: "'". So, in reality, we don't expand to echo ' we expand to echo "'", which is valid.

No, it isn't. See above for the actual explanation of the behavior you observed.

Overall, I strongly recommend relying first and foremost on the specifications for the behavior you want to implement. Experimenting is a fine way to try to clarify and solidify your interpretation of the specs, but it is a very unreliable way to determine the details of the required behavior.

8
KamilCuk On

Which runs first in bash? lexer, or expander?

Lexer. The input has to be tokenized. Only after tokenization, expansion can be executed, however context - double quotes vs here document vs backticks - is relevant for the expansions. This is also documented https://www.gnu.org/software/bash/manual/bash.html#Shell-Expansions :

Expansion is performed on the command line after it has been split into tokens.

In addition to above bash documentation, you might also want to read https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_09_01 .

2
Verpous On

Short answer as others have said is the lexer. I just want to share an interesting bug I once had. I had a script like this:

foo() {
    shopt -s extglob
    # Code which uses extended globbing...
}

foo

It didn't work and it took me a long time to understand why. The reason was that extglob is a feature of the lexer, not the expander. By the time the function is called, its body has already been tokenized, hence activating extglob inside it won't change anything. So this is one experiment to answer your question.