I am using listadmin to manage many mailman-based mailing lists. I have a long list of subjects and from addresses set up to block spam. Recently, I received smarter spam in the sense that it uses nice-looking Unicode characters, eg:
Subject: Al l the ad ult mov ies you' ve see n a r e nothing c ompari- ng t o our exx xci t i ng compilation of 13' 000 mov ies in HD t hat are a v ailable for y ou now!
or
Subject: HD qua lit y vi d eos an d pho to graph s o f ho t c hic ks
are here for u
Now I want to use a smart Perl regex to block that. Piping these subjects to hexdump revealed many characters are a FULLWIDTH LATIN SMALL LETTER. However, \p{FULLWIDTH LATIN SMALL LETTER} doesn't work: Can't find Unicode property definition "FULLWIDTH LATIN SMALL LETTER"
So the question is: Is there a \p{something} to match those fullwidth characters? Alternatively: is there another way to match those characters?
The page
perlunicodedocuments available unicode character classes. I found it as a reference in perlrebackslash, which documents special character classes and backslash sequences like\p{...}in regexes.The summary is that all but the most common property classes require a property type and a property value, which are separated by
:or=. However, there does not seem to be a mention of fullwidth characters as a predefined property.But there is the
Block/Blkproperty, which can haveHalfwidth and Fullwidth Forms(U+FF00–U+FFEF) as value:This will match on your input (tested on v16.3).
A useful tool for this is
uniprops.As you can see,
\p{Block=Halfwidth and Fullwidth Forms}can also be written\p{In Halfwidth and Fullwidth Forms}.