What is the meaning of "equivalence class" in the context of regular expressions?

876 views Asked by At

In certain flavours of regular expression, inside a square bracket expression, the = symbol is a special character that is used as a delimiter for enclosing any one the elements within an equivalence class. The documentation says the following:

An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. Only primary equivalence classes shall be recognized. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". If the collating element does not belong to an equivalence class, the equivalence class expression shall be treated as a collating symbol.

I am not quite sure what this means. If a, à and â belong to the same equivalence class, does that mean we wish to specify that the regular expressions "[ab]", "[àb]" and "[âb]" are equivalent? Then what is the purpose of using the [= =] delimiter, since we might as well write "[aàâb]"?

I understand what an "equivalence class" means in its general definition, but I am unable to grasp its meaning in this context.

1

There are 1 answers

5
Enlico On

Essentially, for example, [=a=] means "all the characters that belong to the equivalence class which a belongs to. If a and à form an equivalence class which only contains those two, then [=a=] and [=à=] are both the same as (obviously written inside a […]). But if the equivalence class also contains â, then all of [=a=], [=à=], and [=â=] mean aàâ (again, written in […]).

What characters an equivalence class contains depends on the locale, but if the locale doesn't define a given equivalence class, let's say [=a=], then the collating sequence with the same name will be used, i.e. [.a.], which normally is just the same as a, as the locales usually include normal characters as collating sequences.

Reference: Mastering Regular Expressions, 3rd Edition, page 128, which is an excellent book on regular expression, written by someone who knows regex down to the smallest bit.