What does context mean by context-free and context-sensitive grammars?

700 views Asked by At

If I have something like var string = "var";, then after the first double quote the rules change, and the var does not mean the same thing as it means at the beginning of the text. After the second double quote things turn back to normal. How is that not considered context?

(Please don't use those arrows in your answer, try a natural language instead!)

2

There are 2 answers

11
rici On BEST ANSWER

The direction of the arrow is important, so if I can't talk about it, it's going to be difficult to explain. So, sorry, I'm going to use arrows. They really aren't complicated.

The expression A -> ... means "an A is ...". It does not mean "... is an A". Context-free means that if A can be "..." in some context, it can be "..." in any context. But the arrow always points from category to specific; never backwards.

In your example, an identifier is a letter followed by a bunch of alphanumeric symbols:

 identifier -> letter (letter OR digit)...

So identifier could be var. That doesn't mean that var is always an identifier, as your example shows. The arrow points in one direction.

Because the grammar is context-free, if when we are looking for an identifier in some context and we accept var as an identifier, then in any other context where we are looking for an identifier, we must also accept var.

But there are contexts (between quotes) where we are not looking for an identifier. That's fine; the context-free condition has not been broken. The context applies in the direction of the arrow.

3
Osr Workshops On

This is a good question because the answers are more subtle than maybe people are giving credit for -- in my opinion, anyhow -- even if we agree to use the word "context" specifically in the technical formal-language sense.

The first answer is good, but from a certain perspective I could see someone feeling it's circular or begs the question (perhaps because the technical definition of "context free" itself is a little circular). After all, why can't a parser look at the second var and "say", in effect: ok, here's a sequence that matches the identifier rule, let's call it an identifier which in this case happens to be inside a string literal (that's not impossible: consider a language where identifiers start with a $ character and are parsed that way in string literals too, which eventually yields a value-substitution).

Sure, a grammar is context-free if all production rules can be stated without any contextual qualifications. The var string = "var"; example illustrates the point insofar as the rule matching the first var to an identifier (i.e., the pattern which generates an identifier in the modeled language) does not internally present any notion of context.

So, why can't a parser just start immediately before the second var and conclude that this second "var" string is also an identifier? If we're thinking in terms of context, we'd say that the parsing sequence up to that point establishes a context where the identifier rule is temporarily suspended. But that identifier rule does not have any way to declare which contexts it "belongs" to.

Presumably context-free rules do not need to describe or declare contexts where they are active because of how a parser would work through strings in the language. In this case, a parser would never "get to" the position immediately before the second var because (for instance) it would match the entire quoted string with one rule, starting at the open quote.

So the answer to my earlier question would be that in a context-free grammar a parser could, indeed, read the second var as an identifier. Except that it wouldn't, because it would never be in a state where it is trying to find rules to match that specific location in the input stream.

In other words, what might seem conceptually to be situations where context is relevant are formally accounted for by engineering rules such that locations where contexts might seem to be in effect are subsumed in larger matches -- they are never deciding factors (causing ambiguity) between two different potential rule-matches. Two competing rules never "collide" in such a way that keeping track of context is necessary to resolve which rule applies. This all relies on the fact that only certain positions in an input string will ever be sites where rules are tested -- others are just consumed by matches for a single rule.

Notice, however, that this is a specific notion of "parsing" based on parsers that scan input from beginning to end. Consider, instead, something like a syntax highlighter that seeks to make an educated guess about the nature of each input character without fully parsing an entire source-code file (and all its includes or etc.). For instance, an ideal highlighter for Qt/C++ source files (actually this doesn't happen even in Qt's own IDE) would color the % in (e.g.) %2 differently as part of a QString (and also differently than the surrounding text which has no special meaning) than in an expression like x%2 (testing whether x is even or odd). A grammar to enable this would infer the meaning of a % even if it started a parsing process at a % character directly, rather than relying on rules to be coordinated so that -- in this case -- the string-literal % is always consumed as part of a string-literal rule.

Notice that a highlighter will, ideally, be reasonably accurate and help even while an input source is being modified, so that it might not be fully parseable in the first place. If I open a new file and position the cursor somewhere, a highlighter could -- again, best-case scenario -- start to observe constructions around the cursor's current location, while perhaps a different thread also launches a formal parse from the start of the file. We might want to analyze rules that are specific to the former case -- trying to process the "neighborhood" of some intermediate location in an input source (e.g., the current cursor location) via rules which cannot be assumed to match only in coordinated sequences from the input start.

In theory, that sort of "processing" of an input string involves a different notion of parsing, which would yield a different classification of grammars. For instance, we could ask about languages with the property that each input character can be assigned a specific type or "catcode" in isolation, without considering its neighbors and without "parsing" up to that point -- i.e., scanners could run in parallel, selecting any individual character, if desired. Imagine a language where identifiers (and only identifiers) use capital letters (and no other characters) whereas quoted strings only use lower-case letters (relying on unicode hex-numbers for any other symbols). In this case one could distinguish identifiers from string literals even when taking one character at random.

Obviously such a language might be too simple for practical use -- but it illustrates the point that languages can be evaluated based on how easy or difficult it is to infer details about input strings (again, consider syntax-highlighting as a practical use-case) by examining the neighborhood of any given string-location, without relying on a top-down parser. In this sense string literals are simpler in languages where the only complication is escape-sequences for quote marks inside the string, as compared to language which allow arbitrary expressions in (maybe special kinds of) strings (e.g., ${} sequences in JavaScript).

If we try to formalize this sense of some languages being more "ambiguous" than others, we might indeed find it useful to employ some notion of "context" which would be more expansive than ordinary formal grammars. Languages that are "context free" vis-a-vis top-down parsers might have a more nuanced reliance on contexts in this alternative sense. In short: to what extent can rules be formulated such that they can be tested against an input site irregardless of what other rules might have matched surrounding positions in the input string?

It is also true that one can perform tricks with context-free rules that entail some concept of context de facto even if not de jure. If your grammar generates C code, for instance, you can always employ some sort of "ambient" context (maybe via global variables) which is accounted for by C code produced from a rule-match even if it is not mentioned in the rule itself. Suppose we have a language where percent-signs have a different meaning inside string literals (cf. Qt strings with .arg substitution) than outside (where, say, they represent a modulo operator). One way to model that situation is by making context explicit, have a "% inside string literals" rule which is distinct from "% outside string literals" and use contexts to activate one or the other (assume we can't just consume string literals in their entirety because of, e.g., the possibility of ${}-style embedded expressions). However, a context-free grammar could just have a rule matching % signs and use the code triggered by that rule to generate different output (or otherwise behave differently) by using a C variable to indicate whether we're currently inside a string literal or not.

Technically this is still a context-free grammar because "branching" to different productions/behaviors occurs within a code body that is produced for any % match; the context does not factor in the rule itself.

And yet obviously some sort of context is involved here, and simply relying on the dictum that context has no formal role so long as rules themselves do not "quantify" over contexts misses what's going on in these situations. Perhaps the best way to say it is that while grammars might be context-free, the languages modeled by some context-free grammars might actually be, in some substantial sense, context-sensitive. We might want to describe a language as "context-sensitive" if the most implementationally fortuitous and conceptually meaningful grammars for the language are context-sensitive, even if it is possible to construct a context-free grammar which does parse the language successfully.

Again, as a case-in-point, if I were writing a compiler for a language with ${}-style expressions embedded in string literals, I would probably opt for a context-sensitive grammar so that I could switch rules on and off in contexts like embedded expressions, rather than a context-free grammar. I would rather write one production to handle "% as a special character in string literals" and a different one for "% as a symbol for modulus" rather than merge the two into a single production with some kind of switch block or if/else branching. The compiler code in the former case would be more meaningful and maintainable, which is a different consideration than the mathematical classification of the language in terms of its formal grammar.

Frankly, I believe context-sensitive grammars have been underappreciated and underinvestigated in formal language theory, plus compiler and software language theory. I think computer scientists (and others) might be a little biased and think of "context free" grammars basically as just describing "formal" languages, which are systematic and unambiguous, unlike natural (human) language. So, on this assumption, if you're working with a computationally tractable language it should have a context-free grammar by definition, and so talking about "context-free" is basically a way to signify that we're talking about formal/computer (rather than natural) language. But that obscures the fact that formal languages can often be modeled more systematically via context-sensitive grammars even if it is theoretically possible to provide a context-free grammar instead.