> Ultimately, though, I'd like to see some of the inner loops to take 
> advantage of SSE (and equivalent) instructions if the number of 
> iterations is large-enough.    So, yes, I think we could get faster.  

When we start to seriously look at this, we should consider using liboil
to implement these optimizations.


"""Liboil is a library of simple functions that are optimized for
various CPUs. These functions are generally loops implementing simple
algorithms, such as converting an array of N integers to floating-point
numbers or multiplying and summing an array of N numbers. Such functions
are candidates for significant optimization using various techniques,
especially by using extended instructions provided by modern CPUs
(Altivec, MMX, SSE, etc.).

"""Each function class has one or more function implementations, which
are real functions that perform the exact same action as defined by the
documentation for the function. Each class has one implementation that
is the reference implementation. This reference implmentation is used to
test the accuracy of other implementations.

Presumably, the non-reference implementations can perform the action
faster than the reference implementation. Thus, the liboil
initialization code (at runtime) checks each implementation in a class
to determine the fastest implementation. Once this is done, the class's
indirect function pointer points to the optimal implementation. After
this, any calls to the function class (such as oil_tablelookup_u8()
described above) will automatically be routed to the fastest implementation.

Implementations can be disabled either at compile time (e.g., assembly
code for the wrong architecture) or at run time (e.g., implementation
uses unsupported opcodes). This is done automatically. In addition,
implementations may be disabled because they do not produce the same
results as the reference implementation.

And the all-important: """Liboil may be modified and distributed in
accordance with a very liberal license commonly referred to as
"Two-Clause BSD"."""

