I would like to create yet another spam detection for my CMS. Currently I do see three options:
- use a simple php class and store tokens in MySQL
- install spamassassin and use a php-connector
- something big like mahout
I do not like the MySQL approach, because I fear that it will grow very big with the time and degrade the performance of the whole system. The spamassassin approach seems to be more attractive, but everywhere on the internet people are writing that SA's rules are focussed on mails and headers and that this is not an ideal way to go. Last but not least i am aware of mahout, but I fear it might be a bit too big and create a lot of administration overhead.
Is there something nice, small and efficient that could be run on a linux server and accessed from php?
the simplest approach would be the tokens in MySQL but I don't know how good this works.
If you want to classify text into span/not-spam categories I think Mahout is a good choice. It is built for BigData and thus requires, if you want map/reduce, a Hadoop setup - but there is also a lightweight alternative you probably could use: the LogisticRegression Algorithm in Mahout.
There is a ModelSerializer class with which you can store your trained model in binary format on your hard disk or somewhere else - so you don't have to setup Hadoop.
You could try:
There is the following class you could use as a code example for your problem:
Here are some more resources regarding Mahout on the web.
So to access this from PHP you could build a small RESTful webservice in Java or simply a command line interface.
Hope this helps a little bit.