Decoding comment and ccontent in RFC 2822

257 views Asked by At

I was working on a regular expression to validate email addresses and I'm getting hung up in a recursive level of quoted-string, specifically comment and ccontent. It seems to me that I'm unable to resolve comment, which references ccontent, because ccontent reference comment. Can anyone set me straight?

ccontent        =       ctext / quoted-pair / comment

comment         =       "(" *([FWS] ccontent) [FWS] ")"

Just in case I'm missing something obvious, I'll explain the recursion from quoted-string.

quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]

Level 2:

CFWS            =       *([FWS] comment) (([FWS] comment) / FWS)

Level 3:

comment         =       "(" *([FWS] ccontent) [FWS] ")"

Level 4:

ccontent        =       ctext / quoted-pair / comment
1

There are 1 answers

1
Hossy On

I've been doing a lot of reading the RFC and watching some YouTubes on Context-Free Grammar and I believe, for my use case, I've come to a reasonable conclusion and would like input.

First, I'm starting with a simple production for an email address:

addr-spec       =       local-part "@" domain

where domain is clearly defined in RFC 1035 and I'll be focusing on local-part here. Per RFC 1035, domain can easily be represented in a regular expression as ^(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$ (use your choice of string start/end anchors).

local-part      =       dot-atom / quoted-string / obs-local-part

local-part gets interesting because, in the three variables that can make it up, all contain folding white space (FWS) references.

dot-atom        =       [CFWS] dot-atom-text [CFWS]
quoted-string   =       [CFWS]
                        DQUOTE *([FWS] qcontent) [FWS] DQUOTE
                        [CFWS]
obs-local-part  =       word *("." word)

Per section 2.2.3, the intention of including folding white space was to overcome the 998/78 character limits but not to allow the white space to become a part of the email address. Because my intention here is to construct a regular expression to process a string and validate its potential as a syntactically valid email address, FWS must be removed and I will therefore not be including it within the regular expression.

In addition, in sections 3.2.4 and 3.2.5, CFWS and the DQUOTEs, in the case of quoted-string, are to be specifically excluded for semantic evaluation. Because of this, I will also be excluding them from the regular expression.

This simplifies things greatly and allows the construction of a strong regular expression for validating an email address. With these changes, I can now rewrite the three variables from local-part as follows:

dot-atom        =       dot-atom-text
quoted-string   =       qcontent
obs-local-part  =       word *("." word)

dot-atom is fairly quick and easy to break down...

dot-atom        =       dot-atom-text
dot-atom-text   =       1*atext *("." 1*atext)
atext           =       ALPHA / DIGIT / ; Any character except controls,
                        "!" / "#" /     ;  SP, and specials.
                        "$" / "%" /     ;  Used for atoms
                        "&" / "'" /
                        "*" / "+" /
                        "-" / "/" /
                        "=" / "?" /
                        "^" / "_" /
                        "`" / "{" /
                        "|" / "}" /
                        "~"

So we get dot-atom=[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*.

The next two are rather curious because obs-local-part contains quoted-string through word and because of how obs-local-part is written, we can exclude quoted-string altogether as being redundant.

quoted-string   =       qcontent
obs-local-part  =       word *("." word)
word            =       atom / quoted-string

During the breakdown of obs-local-part, we come across a new case where we need to ignore something for semantic evaluation: quoted-pair. Looking at section 3.2.2, it instructs us that the "\" character is semantically "invisible" and will therefore be excluded from the regular expression. So, the following RFC definitions...

atom            =       [CFWS] 1*atext [CFWS]
quoted-pair     =       ("\" text) / obs-qp
obs-qp          =       "\" (%d0-127)

become

atom            =       1*atext
quoted-pair     =       text / obs-qp
obs-qp          =       %d0-127

The breakdown of obs-local-part goes seven levels of recursion deep, but suffice to say there is a shortcut that eliminates almost all the thinking here. If you noticed, obs-qp above contains all ASCII characters 0-127. quoted-pair can be obs-qp, qcontent can be quoted-pair, quoted-string is qcontent, and word can be quoted-string. Since the period or "full-stop" character is included in ASCII characters 0-127, we can simplify the definition of obs-local-part to be [\x0-\x7F]+.

Here are the definitions to support this statement:

obs-local-part  =       word *("." word)
word            =       atom / quoted-string

atom            =       1*atext
quoted-string   =       qcontent

qcontent        =       qtext / quoted-pair

qtext           =       NO-WS-CTL /     ; Non white space controls

                        %d33 /          ; The rest of the US-ASCII
                        %d35-91 /       ;  characters not including "\"
                        %d93-126        ;  or the quote character
quoted-pair     =       text / obs-qp

NO-WS-CTL       =       %d1-8 /         ; US-ASCII control characters
                        %d11 /          ;  that do not include the
                        %d12 /          ;  carriage return, line feed,
                        %d14-31 /       ;  and white space characters
                        %d127
text            =       %d1-9 /         ; Characters excluding CR and LF
                        %d11 /
                        %d12 /
                        %d14-127 /
                        obs-text
obs-qp          =       "\" (%d0-127)

obs-text        =       *LF *CR *(obs-char *LF *CR)

obs-char        =       %d0-9 / %d11 /          ; %d0-127 except CR and
                        %d12 / %d14-127         ;  LF

Coming full circle, let's revisit the original definitions:

addr-spec       =       local-part "@" domain
local-part      =       dot-atom / quoted-string / obs-local-part

and combine into one place the defined regular expressions:

domain          =   (?=.{0,255})[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*
dot-atom        =   [a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*
quoted-string   =   [\x0-\x7F]+
obs-local-part  =   [\x0-\x7F]+

One final piece to RFC 2822 is section 3.4.1 that states:

The locally interpreted string is either a quoted-string or a dot-atom. If the string can be represented as a dot-atom (that is, it contains no characters other than atext characters or "." surrounded by atext characters), then the dot-atom form SHOULD be used and the quoted-string form SHOULD NOT be used. Comments and folding white space SHOULD NOT be used around the "@" in the addr-spec.

Because of this (and the definitions of SHOULD and SHOULD NOT), we have two regular expressions we can use to validate an email address depending on how strict we want to be.

Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs

^[\x0-\x7F]+@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$

Option 2: Prefer the use of dot-atom

^[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$

A note about RFC 5322...

The last thing I'm going to add here is that while RFC 2822 was written in April 2001 and provided no character limit to local-part, RFC 5322 came around in October 2008 and defines a limit of 64 octets in section 4.5.3.1.1. So, we would rewrite the options above as:

Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs

^(?=[^@]{0,64}@)[\x0-\x7F]+@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$

Option 2: Prefer the use of dot-atom

^(?=[^@]{0,64}@)[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*@(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$

Preference note...

I am one who believes that DNS and email addresses should always be represented in lowercase, so my regular expressions are written this way. For processing DNS and mailbox names, the case has always been ignored (as far as I'm aware). For validation, you can either run the regular expression in case-insensitive mode or convert your string input to lowercase before validation.