A couple of months ago Harvard University and Google researchers did a study where they mined the complete text of 4 percent of the world’s books and came out with interesting statistics about the English vocabulary.
Has anyone done something similar for a programming language?
Yes, this is a similar analysis that was done against massive amounts of code and multiple languages in github: http://corte.si/posts/code/devsurvey/index.html
Also, on a small scale, code analysis and code metrics tools that are used with most IDEs will provide that sort of analysis within a single codebase--spitting out interesting things like cyclomatic complexity, lines of code, etc.--which are similar in a way. Sort of like analyzing a single book instead of a library.