matlab plot histogram indicating sum of each character inside a file

377 views Asked by At

I have 400 files, each one contains about 500000 character, and those 500000 characters consists only from about 20 letters. I want to make a histogram indicating the most 10 letters used (x-axis) and number of times each letter is used (y-axis). how can i make it.

2

There are 2 answers

7
beaker On

Since you have an array of uchar, you know that your elements will always be in the range 0:255. After seeing Tamás Szabó's answer here I realized that the null character is exceedingly unlikely in a text file, so I will just ignore it and use the range 1:255. If you expect to have null characters, you'll have to adjust the range.

In order to find the 10 most frequently-used letters, we'll first calculate the histogram counts, then sort them in descending order and take the first 10:

counts = histc(uint8(part), [1:255]);
[topCounts, topIndices] = sort(counts, 'descend');

Now we need to rearrange the counts and indices to put the letters back in alphabetical order:

[sortedChars, shortIndices] = sort(topIndices(1:10));
sortedCounts = topCounts(shortIndices);

Now we can plot the histogram using bar:

bar(sortedCounts);

(You can add the 'hist' option if you want the bars in the graph touching like they do in the normal hist plot.)

To change the horizontal legend from numeric values to characters, use sortedChars as the 'XtickLabel':

labelChars = cellstr(sortedChars.').';
set(gca, 'XtickLabel', labelChars);
3
Luis Mendo On

Note: This answers the original version of the question (the data consists of 10 letters only; a histogram is wanted). The question was completely changed (the data consists of about 20 letters, and a histogram of the 10 most used letters is wanted).


If the ten letters are arbitrary and not known in advance, you can't use hist(..., 10). Consider the following example with three arbitrary "letters":

h = hist([1 2 2 10], 3);

The result is not [1 2 1] as you would expect. The problem is that hist chooses equal-width bins.

Here are three approaches to do what you want:

  1. You can find the letters with unique and then do the sum with bsxfun:

    letters = unique(part(:)).';             %'// these are the letters in your file
    h = sum(bsxfun(@eq, part(:), letters));   %// count occurrences of each letter
    
  2. The second line of the above approach could be replaced by histc specifying the bin edges:

    letters = unique(part(:)).';
    h = histc(part, letters);
    
  3. Or you could use sparse to do the accumulation:

    t = sparse(1, part, 1);
    [~, letters, h] = find(t);
    

As an example, for part = [1 2 2 10] either of the above gives the expected result,

letters =
     1     2    10
h =
     1     2     1