I was supposed to do a project to pass my course. I would like to ask, if there is any possibility to make my code more effective or just better. I'm doing it because my coordinator is a very meticulous perfectionist and crazy about efficiency. It's a hybrid program, it modifies a 24bpp bitmap. It's a contrast reduction, algorithm looks like this(it's approved by my coordinator):
comp-=128;
comp*=rfactor
comp/=128
comp+=128
'comp' means every component of a pixel, literally: every value of red, green and blue in every pixel. The function does just this, I read from file using another functions in C. I forward to assembly an array with components, the width of the bmp, amount of pixels in each line, and the 'rfactor' - value of contrast reduction. then I just make this:
; void contrast(void *img, int width, int lineWidth, int rfactor);
; stack: EBP+8 -> *img
; EBP+12 -> width [px]
; EBP+16 -> lineWidth [B]
; EBP+20 -> rfactor (values in range of 1-128)
section .text
global contrast
contrast:
push ebp
mov ebp, esp
push ebx
mov ebx, [ebp+12] ; width
mov eax, [ebp+16] ; lineWidth
mul ebx ; how much pixels to reduce
mov ecx, eax ; set counter
mov edx, [ebp+8] ; edx = pointer at img
mov ebx, [ebp+20] ; ebx=rfactor
loop:
xor eax, eax
dec ecx ; decrement counter
mov al, [edx] ; current pixel to al
add eax, -128
imul bl ; pixel*rfactor
sar eax, 7 ; pixel/128
add eax, 128
mov byte[edx], al ; put the pixel back
inc edx ; next pixel
test ecx, ecx ; is counter 0?
jnz loop
koniec:
pop ebx
mov esp, ebp
pop ebp
ret
Is there anything to improve? Thank you for all suggestions, I have to impress my coordinator ;)
I you are still interested in a SIMD version here is one.
It use AVX2 instructions so you need at least a 4th generation processor (Haswell micro-architecture).
I have tested it by comparing its output with the output of your code.
The prototype in C is your old one (with lineWidth):
I have done some profiling on my machine. I have run this version and the one in your answer on a 2048x20480 image (120MiB buffer) 10 times. Your code takes 2.93 seconds, this one 1.09 seconds. Though this timings may not be very accurate.
This version require a buffer size that is a multiple of 16 (because it processes 16 bytes per cycle, 5 and one third of pixel at a time), you can pad with zeros. If the buffer is aligned on 16 byte boundaries it will run faster.
If you want a more detailed answer (with useful comments for example :D) just ask in the comments.
EDIT: Updated the code with the great help of Peter Cordes, for future reference.