Email Id validation according to RFC5322 and https://en.wikipedia.org/wiki/Email_address

6k views Asked by At

Validating E-mail Ids according to RFC5322 and following

https://en.wikipedia.org/wiki/Email_address

Below is the sample code using java and a regular expression to validate E-mail Ids.

public void checkValid() {
    List<String> emails = new ArrayList();
    //Valid Email Ids
    emails.add("[email protected]");
    emails.add("[email protected]");                   
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("[email protected]");
    emails.add("carlosd'[email protected]");
    emails.add("[email protected]");
    emails.add("admin@mailserver1");
    emails.add("[email protected]");
    emails.add("\" \"@example.org");
    emails.add("\"john..doe\"@example.org");

    //Invalid emails Ids
    emails.add("Abc.example.com");
    emails.add("A@b@[email protected]");
    emails.add("a\"b(c)d,e:f;g<h>i[j\\k][email protected]");
    emails.add("just\"not\"[email protected]");
    emails.add("this is\"not\\[email protected]");
    emails.add("this\\ still\"not\\[email protected]");
                    emails.add("1234567890123456789012345678901234567890123456789012345678901234+x@example.com");
    emails.add("[email protected]");
    emails.add("[email protected]");

    String regex = "^[a-zA-Z0-9_!#$%&'*+/=? \\\"`{|}~^.-]+@[a-zA-Z0-9.-]+$";

    Pattern pattern = Pattern.compile(regex);
    int i=0;
    for(String email : emails){
        Matcher matcher = pattern.matcher(email);
        System.out.println(++i +"."+email +" : "+ matcher.matches());
    }
}

Actual Output:

   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   [email protected] : true
   9.carlosd'[email protected] : true
   [email protected] : true
   11.admin@mailserver1 : true
   [email protected] : true
   13." "@example.org : true
   14."john..doe"@example.org : true
   15.Abc.example.com : false
   16.A@b@[email protected] : false
   17.a"b(c)d,e:f;g<h>i[j\k][email protected] : false
   18.just"not"[email protected] : true
   19.this is"not\[email protected] : false
   20.this\ still"not\[email protected] : false
   21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com    : true
   [email protected] : true
   [email protected] : true

Expected Ouput:

[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
[email protected] : true
9.carlosd'[email protected] : true
[email protected] : true
11.admin@mailserver1 : true
[email protected] : true
13." "@example.org : true
14."john..doe"@example.org : true
15.Abc.example.com : false
16.A@b@[email protected] : false
17.a"b(c)d,e:f;g<h>i[j\k][email protected] : false
18.just"not"[email protected] : false
19.this is"not\[email protected] : false
20.this\ still"not\[email protected] : false
21.1234567890123456789012345678901234567890123456789012345678901234+x@example.com : false
[email protected] : false
[email protected] : false

How can I change my regular expression so that it will invalidate the below patterns of email ids.

1234567890123456789012345678901234567890123456789012345678901234+x@example.com
[email protected]
[email protected] 
just"not"[email protected]

Below are the criteria for regular expression:

Local-part

The local-part of the email address may use any of these ASCII characters:

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9;
  3. special characters !#$%&'*+-/=?^_`{|}~
  4. dot ., provided that it is not the first or last character unless quoted, and provided also that it does not appear consecutively unless quoted (e.g. [email protected] is not allowed but "John..Doe"@example.com is allowed);
  5. space and "(),:;<>@[\] characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash); comments are allowed with parentheses at either end of the local-part; e.g. john.smith(comment)@example.com and (comment)[email protected] are both equivalent to [email protected].

Domain

  1. uppercase and lowercase Latin letters A to Z and a to z;
  2. digits 0 to 9, provided that top-level domain names are not all-numeric;
  3. hyphen -, provided that it is not the first or last character. Comments are allowed in the domain as well as in the local-part; for example, john.smith@(comment)example.com and [email protected](comment) are equivalent to [email protected].
3

There are 3 answers

4
AudioBubble On BEST ANSWER

You could RFC5322 like this
( reference regex modified )

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|((?:[0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"  

https://regex101.com/r/ObS3QZ/1

 # (?im)^(?=.{1,64}@)(?:("[^"\\]*(?:\\.[^"\\]*)*"@)|((?:[0-9a-z](?:\.(?!\.)|[-!#\$%&'\*\+/=\?\^`\{\}\|~\w])*)?[0-9a-z]@))(?=.{1,255}$)(?:(\[(?:\d{1,3}\.){3}\d{1,3}\])|((?:(?=.{1,63}\.)[0-9a-z][-\w]*[0-9a-z]*\.)+[a-z0-9][\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\w]*))$

 # Note - remove all comments '(comments)' before runninig this regex
 # Find  \([^)]*\)  replace with nothing

 (?im)                                     # Case insensitive
 ^                                         # BOS

                                           # Local part
 (?= .{1,64} @ )                           # 64 max chars
 (?:
      (                                         # (1 start), Quoted
           " [^"\\]* 
           (?: \\ . [^"\\]* )*
           "
           @
      )                                         # (1 end)
   |                                          # or, 
      (                                         # (2 start), Non-quoted
           (?:
                [0-9a-z] 
                (?:
                     \.
                     (?! \. )
                  |                                          # or, 
                     [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
                )*
           )?
           [0-9a-z] 
           @
      )                                         # (2 end)
 )
                                           # Domain part
 (?= .{1,255} $ )                          # 255 max chars
 (?:
      (                                         # (3 start), IP
           \[
           (?: \d{1,3} \. ){3}
           \d{1,3} \]
      )                                         # (3 end)
   |                                          # or,   
      (                                         # (4 start), Others
           (?:                                       # Labels (63 max chars each)
                (?= .{1,63} \. )
                [0-9a-z] [-\w]* [0-9a-z]* 
                \.
           )+
           [a-z0-9] [\-a-z0-9]{0,22} [a-z0-9] 
      )                                         # (4 end)
   |                                          # or,
      (                                         # (5 start), Localdomain
           (?= .{1,63} $ )
           [0-9a-z] [-\w]* 
      )                                         # (5 end)
 )
 $                                         # EOS

How make [email protected] this as valid email ID – Mihir Feb 7 at 9:34

I think the spec wants the local part to be either encased in quotes
or, to be encased by [0-9a-z].

But, to get around the later and make [email protected] valid, just
replace group 2 with this:

      (                             # (2 start), Non-quoted
           [0-9a-z] 
           (?:
                \.
                (?! \. )
             |                              # or, 
                [-!#\$%&'\*\+/=\?\^`\{\}\|~\w] 
           )*
           @

      )                             # (2 end)

New regex

"(?im)^(?=.{1,64}@)(?:(\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"@)|([0-9a-z](?:\\.(?!\\.)|[-!#\\$%&'\\*\\+/=\\?\\^`\\{\\}\\|~\\w])*@))(?=.{1,255}$)(?:(\\[(?:\\d{1,3}\\.){3}\\d{1,3}\\])|((?:(?=.{1,63}\\.)[0-9a-z][-\\w]*[0-9a-z]*\\.)+[a-z0-9][\\-a-z0-9]{0,22}[a-z0-9])|((?=.{1,63}$)[0-9a-z][-\\w]*))$"

New demo

https://regex101.com/r/ObS3QZ/5

2
JackPGreen On

It's not the question you asked, but why re-invent the wheel?

Apache commons has a class that covers this already.

org.apache.commons.validator.routines.EmailValidator.getInstance().isValid(email)

This way you aren't responsible for keeping up to date with changing email format standards.

2
coladict On

A regular expression is the most difficult and error-prone way to validate emails addresses. If you are using an implementation of javax.mail to send the emails, then the simplest way to determine if it will work is by using the provided parser, because whether the email is compliant or not, if the library cannot use it, then it doesn't matter.

public static boolean validateEmail(String address) {
    try {
        // if this fails, the mail library can't send emails to this address
        InternetAddress ia = new InternetAddress(address, true);
        return ia.isGroup() && ia.getAddress().charAt(0) != '@';
    }
    catch (Throwable t) {
        return false;
    }
}

Invoking it with false allows emails without a @domain part when strict parsing. And since the checkAddress function invoked internally is private and we can't just call checkAddress(addr,false,true) since we don't want routing information (a feature practically designed for fraud through server bouncing), we have to check the first letter of the validated address.

Now what you may notice here is that this validation method is actually compliant to RFC 2822, rather than 5822. The reason for this is because unless you are implementing your own SMTP sender library, then you're using one that depends on this one, and if you have an address that is 5822-valid but 2822-invalid, then your 5822-validation is rendered useless. But if you are implementing your own 5822 SMTP library, then you should learn from the existing ones and write a parser function, rather than a regular expression.