Partially fixed. I was messing the row, column order. For some reason this was working in some case. Now I've fixed it and it *always* works. However, it is still slower than the cblas cblas -> 0.69 sec scipy blas -> 0.74 sec Any clue why?