[Numpy-discussion] searchsorted() and memory cache

Bruce Southey bsouthey@gmail....
Fri May 9 08:36:31 CDT 2008

I don't know if it helps, but Ulrich Drepper had a 9 part series about 
memory in Linux Weekly News (http://lwn.net). You can under all 9 linked 
under his name in the Guest archives 
(http://lwn.net/Archives/GuestIndex/)  as not all are linked together. 
The first one is:

What every programmer should know about memory, Part 1 
<http://lwn.net/Articles/250967/> (September 21, 2007)

In part 5 there was a comment on 'Cache-oblivious algorithms' by *akapoor*:

"i guess it's worth mentioning harald-prokop's 1999 thesis on "cache oblivious algorithms"


Francesc Alted wrote:
> A Friday 09 May 2008, Andrew Straw escrigué:
>> I've got a big element array (25 million int64s) that searchsorted()
>> takes a long time to grind through. After a bit of digging in the
>> literature and the numpy source code, I believe that searchsorted()
>> is implementing a classic binary search, which is pretty bad in terms
>> of cache misses. There are several modern implementations of binary
>> search which arrange items in memory such that cache misses are much
>> more rare. Clearly making such an indexing arrangement would take
>> time, but in my particular case, I can spare the time to create an
>> index if searching was faster, since I'd make the index once but do
>> the searching many times.
>> Is there an implementation of such an algorithm that works easilty
>> with numpy? Also, can you offer any advice, suggestions, and comments
>> to me if I attempted to implement such an algorithm?
> Well, if you can afford extra space for the hashes you can always use a 
> dictionary for doing the lookups.  In pure Python they are around 3x 
> faster (for arrays of 8 millions of elements) than binary searches.  If 
> your space is tight, you can build an extension (for example in Pyrex) 
> for doing binary search for your specific type (int64), for an small 
> improvement.  Finally, if you combine this approach with what is 
> suggesting Charles Harris (i.e. creating several levels of caches, but 
> not more than two, which in my experience works best), you can have 
> pretty optimal lookups with relatively low space overhead.
> See this thread for a discussion of the binary/hash lookup approaches:
> http://mail.python.org/pipermail/python-list/2007-November/465900.html
> Hope that helps,

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080509/220f0cc7/attachment.html 

More information about the Numpy-discussion mailing list