Preface: This is for a school assignment. I'm not trying to harvest e-mails for malicious purposes.
I need to identify, extract, and transform email addresses from a given file (as a command-line argument). For an obfuscated email address, I need to transform the e-mail back to a regular email address format (account-name@domain-name).
These are the obfuscation techniques I need to account for:
No obfuscation. An email address may be included in a pair of <>. For example,
1. <[email protected]> or [email protected].
2. A space MAY be added before or after (or both) the @ sign.
3. The @ sign is written as AT or at, and space is added before and after AT or at.
4. The . sign in domain name is written as DOT or dot, and space is added before and after DOT
or dot.
Currently I am just trying to account for the first technique. "1. <[email protected]> or [email protected]."
Here is what I have so far:
EDITED: Using the help from @ikegami
#!/usr/bin/perl -w
use warnings;
use strict;
my @addrs;
my $re;
open my $INFILE, '<', $ARGV[0] or die $!;
while( my $line = <$INFILE> ) {
push @addrs, $line =~ /(\w+\@(?:\w+\.)*\w+)/g;
foreach $re (@addrs) {
if ($re =~ (/$line/)) {
print $re;
}
}
}
close $INFILE;
No longer getting an error, however getting no output.
Sample Input:
Email: <[email protected]> email: [email protected] [email protected]
Email: anonym3 AT efs.new.edu E-mail: anonym4 at efs.new.edu test at 9:00PM
We will have a test in room 705 @ another time.
Email: anonym5 @ efs dot new dot edu what if we continue
Another test anonym6 at efs dot new dot edu
If you type a dot, it means you have finished typing all contents.
Email:anonym7 AT new DOT efs DOT edu
We can, at 10:00PM, go to library DOT or .
My gmail address is jeff.anonym@gmail DOT com
The output should be:
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Any help/point in the right direction would be greatly appreciated!
The downvotes are probably related to the question's subject matter: You appear to be working toward a solution to the problem of subverting people's attempts to avoid having their email addresses harvested. People usually obfuscate their addresses because they don't want them harvested. So whatever solution you come up with can be used to violate the intent of those who posted obfuscated addresses. However, it's not your fault the professor chose an unpopular topic.
Moving past that....
Detecting obfuscated email addresses in text is an imperfect science, and somewhat of an arms race. Even detecting valid email addresses is tricky. There will be false positives, and possibly false negatives too. And even with a pretty good looking list of email addresses could still have addresses that may be technically correct but practically of no use.
The CPAN module Email::Address is good at parsing text for email addresses, but it comes up with false positives. Email::Valid is good at rejecting many of those false positives. Here's an example that produces the output you're looking for:
This produces the following output:
UPDATE:
Here is a (less robust, but working) solution that avoids the use of external CPAN modules, solving the core problem using regular expressions.
I used the
/x
modifier to allow the regular expression to be freeform (insignificant whitespace, and comments allowed). That allowed it to be broken into smaller more legible chunks.