Using regex to filter bunch of email addresses in text with some specific conditions

1.4k views Asked by At

I'm experimenting with regex and I'm trying to filter out bunch of email addresses that are embedded in some text source. The filter process will be on two specific conditions:

  1. Every email starts with abc

  2. Regular email patter which includes an @ followed by a . and ending specifically in com

Source:

sajgvdaskdsdsds[email protected]sdksdhkshdsdk[email protected]wdgjkasdsdad

Pattern1 = "abc[\w\W][@][\w]\.com

code:

public class Test {

    /**
     * @param args the command line arguments
     */
    public static void main(String[] args)
    {
        boolean found = false;
        String source = "[email protected]@gmail.comwdgjkasdsdad";


        String pattern1 = "abc[\\w\\W]*[@][\\w]*\\.com";

        Pattern p1 = Pattern.compile(pattern1);
        Matcher m1 = p1.matcher(source);
        System.out.println("Source:\t" + source);
        System.out.println("Exprsn:\t" + m1.pattern());
        while (m1.find())
        {
            found = true;
            System.out.println("Pos: " + m1.start() + "\tFound: " + m1.group());
        }
        System.out.println();
        if(!found)
        {
            System.out.println("Nothing found!");
        }

    }

}

I'm expecting o/p as:

Pos: 15 Found: [email protected]

Pos: 48 Found: [email protected]

But getting:

Pos: 15 Found: [email protected]@gmail.com

If I use this Pattern2: abc[\\w]*[@][\\w]*\\.com then I'm getting the expected o/p. However, the thing is email address can contain non-word characters after abc and before @. (For example: [email protected]).

Hence Pattern2 doesn't work with non-word characters. So, I went with [\\w\\W]* instead of [\\w]*.

I also tried Pattern3: abc[\\w\\W][@][\\w]\\.com[^.] and still doesn't work.

Please help me, where am I doing wrong?

3

There are 3 answers

5
Mad Physicist On BEST ANSWER

Regex operators are greedy by default, meaning that they will grab as much of the string as they can. [\w\W]* will grab all intervening @ characters except for the very last one.

Either use the reluctant form of the operators (e.g. *? instead of *), or just simplify the expression:

abc[^@]*@[^.]+\.com

[^@] will take as many characters that aren't @ as it can find. Similarly [^.] will match everything until the first dot.

Alternatively, you can use reluctant operators:

abc.*?@.*?\.com
3
Andrey Tyukin On

Try to exclude '@' from the left part:

"abc[\\w\\W&&[^@]]+@[\\w]+\\.com"

Then in the following input:

"[email protected]" + 
"[email protected]" + 
"[email protected]"

it matches:

[email protected]
[email protected]
[email protected]

The [foo&&[^bar]] syntax in the regex means: include all foo, but exclude all bar.


EDIT: the pattern [\\w\\W&&[^@]] is slightly nonsensical, because it's the same as [^@]. However, if you want to restrict \\w\\W to something more meaningful, it would still work.

1
hugh On

In your first character class - \\w includes all word-characters, [a-zA-Z_0-9]. \\W is just the complement of this, so I putting these together is capable of matching anything. Ideally you'd use a whitelist of the characters you expect in here (\n probably isn't allowed!), but the key thing is that you definitely don't want @, so this will give you the two matches:

"abc[^@]*[@][\\w]*\\.com"

I'd suggest that the other square brackets are superfluous, so should be removed, and that that second group should really have at least one character. This would leave you with:

"abc[^@]*@\w+\.com"