I have a bit bang code that allows me to send like 4 megs of data through SPI lines. Its embedded code for custom Hardware using a Linux Kernel.
The problem is that takes a VERY long time to do it (4 hours) this is most likely becase the kernel is doing more stuff. Basically my code is something like this(aprox):
unsigned char data=0xFF;
BB_SPI_Init();
SPI_start();//activates chipselect(enable)
for(i=0;i<8;i++){
    if(data & 0x80){
        gpio_set_value(SPI_MOSI,1);
    }else{
        gpio_set_value(SPI_MOSI,0);
    }
    //send pulse clock
    gpio_set_value(SPI_CLK,0);
    gpio_set_value(SPI_CLK,1);
    data<<=1;
}
SPI_stop();//deactivates chipselect(disable)
So is a very simple bit bang, but i notice that if i use write to send data to the linux gpio handler /sys/class/gpio/gpioXX/value (where XX is any gpio number) it takes 4 hours. 
But if i use fwrite() for sending to the same device it takes 3 hours.
BUT, if you use write() only for the enables ( SPI_stop(), and  SPI_start()) and fwrite() for sending to MISO, CLK it only takes 1 hour and 30 minutes.
So, with that as a base, could someone explain to me how is that happening? my imagination says that is the way the threads are handled and in every software cycle it resolves 2 threads (fwrite() and write()) instead if was only one of the functions used, but now i'm still investigating, can someone let me know any kind of information? is there a better way to handle this?
FYI Can't use kernel driver spi because the hardware was connected to gpios and it is a mandatory requirement to use bit bang but i accept any suggestion
Thanks in advance
EDIT
Hey Guys thanks for your comments, it seems that i had a problem (very dumb one) that i created the file descriptor each time that i was going to send data to sys/class/gpio/gpioxx/value so that's why was slow. Also turn off some other programs and the transfer skyrocket to 3 minutes instead of 1 hour 30 minutes (with write()). Thanks and Sorry about it
 
                        
I think that the spi-bitbang driver is the best solution if you are looking for performance. Doing the bit-bang from user space is a pain because you have at least 3 system calls for each bit of data. A system call is an expensive operation.
That's why the spi-bitbang driver exists. You can easily configure the spi-bitbang driver to work with your GPIOs.
Then, once you have a spi-bitbang driver, you can write a char device that accept as input your entire block of data and transfer it in kernel space. With this solution you will get the maximum performance for a bit-bang interface.