Regular expression transform obfuscated e-mail addresses Perl

754 views Asked by At

Preface: This is for a school assignment. I'm not trying to harvest e-mails for malicious purposes.

I need to identify, extract, and transform email addresses from a given file (as a command-line argument). For an obfuscated email address, I need to transform the e-mail back to a regular email address format (account-name@domain-name).

These are the obfuscation techniques I need to account for:

No obfuscation. An email address may be included in a pair of <>. For example, 
1. <[email protected]> or [email protected].
2. A space MAY be added before or after (or both) the @ sign.
3. The @ sign is written as AT or at, and space is added before and after AT or at. 
4. The . sign in domain name is written as DOT or dot, and space is added before and after DOT 
or dot.

Currently I am just trying to account for the first technique. "1. <[email protected]> or [email protected]."

Here is what I have so far:

EDITED: Using the help from @ikegami

#!/usr/bin/perl -w

use warnings;
use strict;

my @addrs;
my $re; 
open my $INFILE, '<', $ARGV[0] or die $!;

while( my $line = <$INFILE> ) {
     push @addrs, $line =~ /(\w+\@(?:\w+\.)*\w+)/g;

foreach $re (@addrs) {
     if ($re =~ (/$line/)) {
        print $re;
     }
}
}   


close $INFILE;

No longer getting an error, however getting no output.

Sample Input:

Email: <[email protected]> email: [email protected] [email protected]
Email: anonym3 AT efs.new.edu E-mail: anonym4 at efs.new.edu test at 9:00PM
We will have a test in room 705 @ another time.
Email: anonym5 @ efs dot new dot edu what if we continue
Another test anonym6 at efs dot new dot edu
If you type a dot, it means you have finished typing all contents.
Email:anonym7 AT new DOT efs DOT edu
We can, at 10:00PM, go to library DOT or .
My gmail address is jeff.anonym@gmail DOT com

The output should be:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

Any help/point in the right direction would be greatly appreciated!

2

There are 2 answers

7
DavidO On BEST ANSWER

The downvotes are probably related to the question's subject matter: You appear to be working toward a solution to the problem of subverting people's attempts to avoid having their email addresses harvested. People usually obfuscate their addresses because they don't want them harvested. So whatever solution you come up with can be used to violate the intent of those who posted obfuscated addresses. However, it's not your fault the professor chose an unpopular topic.

Moving past that....

Detecting obfuscated email addresses in text is an imperfect science, and somewhat of an arms race. Even detecting valid email addresses is tricky. There will be false positives, and possibly false negatives too. And even with a pretty good looking list of email addresses could still have addresses that may be technically correct but practically of no use.

The CPAN module Email::Address is good at parsing text for email addresses, but it comes up with false positives. Email::Valid is good at rejecting many of those false positives. Here's an example that produces the output you're looking for:

use Email::Address;
use Email::Valid;

while( <DATA> ) {
  print "$_\n" for 
    grep { Email::Valid->address( $_->address ) }
      Email::Address->parse( fix_common_obfu($_) );
}

sub fix_common_obfu {
  my $string = shift;
  $string =~ s/\s+dot\s+/./ig;        # Turn ' dot ' into '.'
  $string =~ s/\s+@\s+|\s+at\s+/@/ig; # Turn ' @ ' or ' at ' into '@'.
  return $string;
}

__DATA__
Email: <[email protected]> email: [email protected] [email protected]
Email: anonym3 AT efs.new.edu E-mail: anonym4 at efs.new.edu test at 9:00PM
We will have a test in room 705 @ another time.
Email: anonym5 @ efs dot new dot edu what if we continue
Another test anonym6 at efs dot new dot edu
If you type a dot, it means you have finished typing all contents.
Email:anonym7 AT new DOT efs DOT edu
We can, at 10:00PM, go to library DOT or .
My gmail address is jeff.anonym@gmail DOT com

This produces the following output:

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]

UPDATE:

Here is a (less robust, but working) solution that avoids the use of external CPAN modules, solving the core problem using regular expressions.

while( <DATA> ) {
  s/\s+@\s*|\s*@\s+/@/g;    # Deal with spaces around @.
  s/\s+at\s+/@/ig;          # Deal with " at "
  s/\s+dot\s+/./ig;         # Deal with " dot "
  while (
    m/(           # Match and capture.
      [^<\s:]+    # One or more characters except <, space, or :.
      @           # Literal '@'
      \S+         # One or more non-space.
      \.          # Require a dot in the domain.
      [^>\s]+     # One or more any character except whitespace or '>'.
    )/gx          # Repeat as many matches as found, allow freeform.
  ) {
    print "$1\n";
  }
}

__DATA__
.... etc etc etc...

I used the /x modifier to allow the regular expression to be freeform (insignificant whitespace, and comments allowed). That allowed it to be broken into smaller more legible chunks.

4
ikegami On

It's hard to explain what you did wrong because there's so much of it, so I'm just going to provide some correct code. This will scan for one of the types of emails:

my @addrs;
while ( my $line = <$INFILE> ) {
   push @addrs, $line =~ /(\w+\@(?:\w+\.)*\w+)/g;
}

or just

my @addrs;
while (<$INFILE>) {
   push @addrs, /(\w+\@(?:\w+\.)*\w+)/g;
}

Tip: Since you accept the file to read as an argument, you might as well use <>. Replace

open my $INFILE, '<', $ARGV[0] or die $!;
while (<$INFILE>) {
    ...
}

with

while (<>) {
    ...
}

It will read from the files named on the command line, or STDIN if no files were provided. So it's simpler and better!