Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."
The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):
int uniqueCharacterCount = string.Distinct().Count();
I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.
Thanks!
String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.
String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.
A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.