I recently started running a text-only usenet server. The way that I have storage working is that each news article (post) is in the following kind of direction.
/var/spool/news/articles/alt/i/love/cat/1 2 3
The numbers are the individual articles. The server has been running for about a few hours and I already have 150M+ of articles. (so, the usenet is obviously not as dead as expected). I would like to start harvesting spammers so I can add them to my spam filter proactively. One common aspect of spammers is that they tend to post to lots of newsgroups at once. Each message header contains the following line:
newsgroups: a,b,c
Each letter is representing another newsgroups. I would like to find a way to run a daily report to tell me which articles have >3 commas in the newsgroup line of the header. Here's what I've come up with so far:
find /var/spool/news/articles/ -name "*" | grep '[0-9]' > list.txt &&
while read i; do echo $i; grep Newsgroups $i | grep -c [\,]; done <list.txt
The find command will create a list of every message minus any directories. The while-loop will display the message path and on a separate line it will display the number of commas.
Ideally I would like to have the output be something like this:
/var/spool/news/articles/alt/i/love/cat/1 2
I could then put the output into a spreadsheet and sort by the number of commas, but that carriage return between the path and the number is messing me up. Sorry for the long post, but I figured I would explain what I'm doing in case someone else tries to do something like this in the future.
I would also appreciate it if anyone has any suggestions for doing this in a more "sane" manner than I have.