[Numpy-discussion] searchsorted() and memory cache

Francesc Alted falted@pytables....
Fri May 9 04:09:19 CDT 2008


A Friday 09 May 2008, Andrew Straw escrigué:
> I've got a big element array (25 million int64s) that searchsorted()
> takes a long time to grind through. After a bit of digging in the
> literature and the numpy source code, I believe that searchsorted()
> is implementing a classic binary search, which is pretty bad in terms
> of cache misses. There are several modern implementations of binary
> search which arrange items in memory such that cache misses are much
> more rare. Clearly making such an indexing arrangement would take
> time, but in my particular case, I can spare the time to create an
> index if searching was faster, since I'd make the index once but do
> the searching many times.
>
> Is there an implementation of such an algorithm that works easilty
> with numpy? Also, can you offer any advice, suggestions, and comments
> to me if I attempted to implement such an algorithm?

Well, if you can afford extra space for the hashes you can always use a 
dictionary for doing the lookups.  In pure Python they are around 3x 
faster (for arrays of 8 millions of elements) than binary searches.  If 
your space is tight, you can build an extension (for example in Pyrex) 
for doing binary search for your specific type (int64), for an small 
improvement.  Finally, if you combine this approach with what is 
suggesting Charles Harris (i.e. creating several levels of caches, but 
not more than two, which in my experience works best), you can have 
pretty optimal lookups with relatively low space overhead.

See this thread for a discussion of the binary/hash lookup approaches:

http://mail.python.org/pipermail/python-list/2007-November/465900.html

Hope that helps,

-- 
Francesc Alted


More information about the Numpy-discussion mailing list