How does the entropy of a string of English text signify low quality?

13.1k views Asked by At

Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."

The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):

int uniqueCharacterCount = string.Distinct().Count();

I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.

Thanks!

5

There are 5 answers

0
9000 On BEST ANSWER

String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.

String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.

A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.

1
Shekhar_Pro On

Not Exactly an answer to your question but, Wikipedia has this explanation of Entropy:

Entropy is a measure of disorder, or more precisely unpredictability. For example, a series of coin tosses with a fair coin has maximum entropy, since there is no way to predict what will come next. A string of coin tosses with a two-headed coin has zero entropy, since the coin will always come up heads. Most collections of data in the real world lie somewhere in between.

English text has fairly low entropy. In other words, it is fairly predictable. Even if we don't know exactly what is going to come next, we can be fairly certain that, for example, there will be many more e's than z's, or that the combination 'qu' will be much more common than any other combination with a 'q' in it and the combination 'th' will be more common than any of them. Uncompressed, English text has about one bit of entropy for each byte (eight bits) of message.

0
Rex M On

Let's look at the Wikipedia entry on Entropy (information theory):

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message...

And specifically with English information:

The entropy rate of English text is between 1.0 and 1.5 bits per letter, or as low as 0.6 to 1.3 bits per letter, according to estimates by Shannon based on human experiments.

In other words, it's not simply that low entropy is bad and high entropy is good, or vice versa - there is an optimal entropy range.

0
BlueRaja - Danny Pflughoeft On

The confusion seems to be from the idea that this is used to block posts from being posted - it's not.

It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.

The idea is to catch posts like ~~~~~~No.~~~~~~ or FUUUUUUUU------, not to catch all low-quality posts.


As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.

See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926

0
Ilmarinen On

The Shannon Entropy H(P) is the property of a probability distribution P, of a random variable X.

In the case of a string, a rudimentary way of treating it is as a bag of characters. In which case, the frequency count provides an approximation of the probability distribution P, of a randomly chosen character in the string.

If we were to simply count the number of unique characters in a string, this would correlate with the entropy of the uniform distribution of the number of unique characters that appear in that string. And the greater the number of unique characters, the greater would be the entropy.

However, Jeff Atwood (and BlueRaja's) subsequent code contributions are better measures, as they take into account the other possible distributions that a string; still thought of as a bag of (not necessarily unique) characters; represents.

Building on Rex M's answer ... it would make more sense to look for strings where the 'character entropy' fell outside the 1.0 - 1.5 range, as possible 'low quality strings.'