I was working on a regular expression to validate email addresses and I'm getting hung up in a recursive level of quoted-string, specifically comment and ccontent. It seems to me that I'm unable to resolve comment, which references ccontent, because ccontent reference comment. Can anyone set me straight?
ccontent = ctext / quoted-pair / comment
comment = "(" *([FWS] ccontent) [FWS] ")"
Just in case I'm missing something obvious, I'll explain the recursion from quoted-string.
quoted-string = [CFWS]
DQUOTE *([FWS] qcontent) [FWS] DQUOTE
[CFWS]
Level 2:
CFWS = *([FWS] comment) (([FWS] comment) / FWS)
Level 3:
comment = "(" *([FWS] ccontent) [FWS] ")"
Level 4:
ccontent = ctext / quoted-pair / comment
I've been doing a lot of reading the RFC and watching some YouTubes on Context-Free Grammar and I believe, for my use case, I've come to a reasonable conclusion and would like input.
First, I'm starting with a simple production for an email address:
where domain is clearly defined in RFC 1035 and I'll be focusing on
local-part
here. Per RFC 1035,domain
can easily be represented in a regular expression as^(?=.{0,255}$)[a-z][a-z0-9\-]{0,61}[a-z0-9](?:\.[a-z][a-z0-9\-]{0,61}[a-z0-9])*$
(use your choice of string start/end anchors).local-part
gets interesting because, in the three variables that can make it up, all contain folding white space (FWS
) references.Per section 2.2.3, the intention of including folding white space was to overcome the 998/78 character limits but not to allow the white space to become a part of the email address. Because my intention here is to construct a regular expression to process a string and validate its potential as a syntactically valid email address,
FWS
must be removed and I will therefore not be including it within the regular expression.In addition, in sections 3.2.4 and 3.2.5,
CFWS
and theDQUOTE
s, in the case ofquoted-string
, are to be specifically excluded for semantic evaluation. Because of this, I will also be excluding them from the regular expression.This simplifies things greatly and allows the construction of a strong regular expression for validating an email address. With these changes, I can now rewrite the three variables from
local-part
as follows:dot-atom
is fairly quick and easy to break down...So we get
dot-atom
=[a-z0-9!#$%&'*+\-\/=?^_`{|}~](?:\.?[a-z0-9!#$%&'*+\-\/=?^_`{|}~])*
.The next two are rather curious because
obs-local-part
containsquoted-string
throughword
and because of howobs-local-part
is written, we can excludequoted-string
altogether as being redundant.During the breakdown of
obs-local-part
, we come across a new case where we need to ignore something for semantic evaluation:quoted-pair
. Looking at section 3.2.2, it instructs us that the "\" character is semantically "invisible" and will therefore be excluded from the regular expression. So, the following RFC definitions...become
The breakdown of
obs-local-part
goes seven levels of recursion deep, but suffice to say there is a shortcut that eliminates almost all the thinking here. If you noticed,obs-qp
above contains all ASCII characters 0-127.quoted-pair
can beobs-qp
,qcontent
can bequoted-pair
,quoted-string
isqcontent
, andword
can bequoted-string
. Since the period or "full-stop" character is included in ASCII characters 0-127, we can simplify the definition ofobs-local-part
to be[\x0-\x7F]+
.Here are the definitions to support this statement:
Coming full circle, let's revisit the original definitions:
and combine into one place the defined regular expressions:
One final piece to RFC 2822 is section 3.4.1 that states:
Because of this (and the definitions of SHOULD and SHOULD NOT), we have two regular expressions we can use to validate an email address depending on how strict we want to be.
Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs
Option 2: Prefer the use of dot-atom
A note about RFC 5322...
The last thing I'm going to add here is that while RFC 2822 was written in April 2001 and provided no character limit to
local-part
, RFC 5322 came around in October 2008 and defines a limit of 64 octets in section 4.5.3.1.1. So, we would rewrite the options above as:Option 1: Very strict, ignore the SHOULDs and SHOULD NOTs
Option 2: Prefer the use of dot-atom
Preference note...
I am one who believes that DNS and email addresses should always be represented in lowercase, so my regular expressions are written this way. For processing DNS and mailbox names, the case has always been ignored (as far as I'm aware). For validation, you can either run the regular expression in case-insensitive mode or convert your string input to lowercase before validation.