Perl Regex to Exclude Certain TLDs for Spamassassin

516 views Asked by At

I am not at all able to code in Perl; so, what seems like a simple thing -- writing a regex to score all URIs that are not for "com" or "net" or "org" TLDs -- is apparently beyond my skills. Could someone kindly enlighten me?

As an example I want https://foo.com.us/asdf?qwerty=123 to match and ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2 to not match.

2

There are 2 answers

3
Borodin On BEST ANSWER

The regex pattern

//(?:[a-z]+\.)*+(?!com/|net/|org/)

should do what you want. The slashes are part of the pattern, and are not delimiters

Here's a demonstration

use strict;
use warnings;
use 5.010;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for ( @urls ) {
    say m{//(?:[a-z]+\.)*+(?!com/|net/|org/)} ? 'match' : 'no match';
}

output

match
no match
1
Borodin On

You should use the URI module to separate the host name from the rest of the URL

This example extracts only the final substring of the host name, so it will look at, say, uk from bbc.co.uk, but it should serve your purpose

use strict;
use warnings;

use URI;

my @urls = qw{
    https://foo.com.us/asdf?qwerty=123
    ftp://madeup.kernel.org/path/to/some/tarball.tar.bz2
};

for my $url ( @urls ) {
    $url = URI->new($url);
    my $host = $url->host;
    my ($tld) = $host =~ /([^.]+)\z/;

    if ( $tld !~ /^(?com|net|org)\z/ ) {
        # non-standard TLD
    }
}