How to use the Percentile Function for Ranking purpose? (Matlab)

708 views Asked by At

I have the following 606 x 274 table: see here


Goal:

For every date calculate lower and upper 20% percentiles and, based on the outcome, create 2 new variables, e.g. 'L' for "lower" and 'U' for "upper", which contain the ticker names as seen in the header of the table.

Step by step:

% Replace NaNs with 'empty' for the percentile calculation (error: input to be cell array)
     T(cellfun(@isnan,T)) = {[]}
% Change date format
     T.Date=[datetime(T.Date, 'InputFormat', 'eee dd-MMM-yyyy')];
% Take row by row 
     for row=1:606
% If Value is in upper 20% percentile create new variable 'U' that contains the according ticker names.
% If Value is in lower 20% percentile create new variable 'L' that contains the according ticker names.
     end;

So far, experimenting with 'prctile' only yielded a numeric outcome, for a single column. Example:

Y = prctile(T.A2AIM,20,2);

Thanks for your help and ideas!

1

There are 1 answers

2
lucianopaz On BEST ANSWER

Generally speaking, if you have an array of numbers:

a = [4 2 1 8 -2];

percentiles can be computed by first sorting the array and then attempting to access the index supplied in the percentile. So prctile(a,20)'s functionality could in principle be replaced by

b = sort(a);
ind = round(length(b)*20/100);
if ind==0
    ind = 1;
end
b = b(ind);
% b = -2

However, prctile does a bit more of fancy magic interpolating the input vector to get a value that is less affected by array size. However, you can use the idea above to find the percentile splitting columns. If you chose to do it like I said above, what you want to do to get the headers that correspond to the 20% and 80% percentiles is to loop through the rows, remove the NaNs, get the indeces of the sort on the remaining values and get the particular index of the 20% or 80% percentile. Regrettably, I have an old version of Matlab that does not support tables so I couldn't verify if the header names are returned correctly, but the idea should be clear.

L = cell(size(T,1),1);
U = cell(size(T,1),1);
for row=1:size(T,1)
    row_values = T{row,:};
    row_values = row_values(2:end); % Remove date column
    non_nan_indeces = find(~isnan(row_values));
    if not(isempty(non_nan_indeces))
        [row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
        % The +1 is because we removed the date column
        L_ind = non_nan_indeces(sorted_indeces(1:round(0.2*length(row_values))))+1;
        U_ind = non_nan_indeces(sorted_indeces(round(0.8*length(row_values)):end))+1;
        % I am unsure about this part
        L{row} = T.Properties.VariableNames(L_ind);
        U{row} = T.Properties.VariableNames(U_ind);
    else
        L{row} = nan;
        U{row} = nan;
    end
end;

If you want to use matlab's prctile, you would have to find the returned value's index doing something like this:

L = cell(size(T,1),1);
U = cell(size(T,1),1);
for row=1:size(T,1)
    row_values = T{row,:};
    row_values = row_values(2:end); % Remove date column
    non_nan_indeces = find(~isnan(row_values));
    if not(isempty(non_nan_indeces))
        [row_values,sorted_indeces] = sort(row_values(non_nan_indeces));
        L_val = prctile(row_values(non_nan_indeces),20);
        U_val = prctile(row_values(non_nan_indeces),80);
        % The +1 is because we removed the date column
        L_ind = non_nan_indeces(sorted_indeces(find(row_values<=L_val)))+1;
        U_ind = non_nan_indeces(sorted_indeces(find(row_values>=U_val)))+1;
        % I am unsure about this part
        L{row} = T.Properties.VariableNames(L_ind);
        U{row} = T.Properties.VariableNames(U_ind);
    else
        L{row} = nan;
        U{row} = nan;
    end
end;