I'm computing the incremental mean of my input data (which is an array of 6 elements, so i'll end up with 6 means).
This is the code I'm using everytime a new input array is available (obviously I update the number of samples ecc...):
computing_mean:for(int i=0;i<6;i++){
temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i])/number_of_samples;
//Possible optimization?
//temp_mean[i]=temp_mean[i] + divide(input[i]-temp_mean[i],number_of_samples);
}
Where all the data in the code are arrays or single number of the following type:
typedef ap_fixed <36,24,AP_RND_CONV,AP_SAT> decimalNumber;
From my synthesis report this loop hase 324 latency and 54 iteration latency, caused mainly by the division operation.
Are there any ways I can improve the speed of the division? I tried using hls_math and the divide function, but it doesn't seem to work with my type of data.
EDIT 1: I'm including my performance profiler inside vivado HLS. I'll add a self-contained reproducible code later with another edit. As you can see, the majority of the time is spent in SDIV
Other than trigonometric functions like
sin()
(FSIN
= ~50-170 cycles) andcos()
(FCOS
= ~50-120 cycles), or things likesqrt()
(FSQRT
= ~22 cycles), division will always be the most painful.FDIV
is 15 cycles.FADD
andFMUL
are both 5.There are occasions where you can skip division and do bit-shifting instead, if you're working with integer data and the number you're dividing by is a power of 2, but that's about it.
You can look up the approximate CPU cycle cost of any given instruction in tables like this.
FDIV
is an example of an expensive one.That being said, one thing you could try is to compute the division factor in advance, then apply it using multiplication instead:
I'm not sure that's saving a whole lot, but if you really do need to shave off cycles, it's worth a shot.