I took a TBB matrix multiplication from here
This example uses the concept of blocked_range for parallel_for loops. I also ran a couple of programs using Intel MKL and eigen libraries. When I compare the times taken by these implementations, MKL is the fastest, while TBB is the slowest (10 times slower than eigen on an average) for a variety of matrix sizes (2-4096). Is it normal or am I doing something wrong ? Shouldn't TBB performing better than eigen at least ?
That looks like a really basic matrix multiplication algorithm, meant as little more than an example on how to use TBB. There are far better ones and I'm fairly certain the intel MKL will be using SSE / AVX / FMA instructions too.
To put it another way, there wouldn't be any point to the Intel MKL if you could replicate its performance with 20 lines of code. So yes, what you get seems normal.
At the very least, with large matrices, the algorithm needs to take cache and other details of the memory subsystem into account.