[SciPy-Dev] Enhancements to scipy.spatial.cKDTree
Sun Jul 15 19:00:35 CDT 2012
Wow, I must confess I wasn't quite expecting the Spanish Inquisition :-D
The original reason I got into this project was to bring a fast
query_ball_point to cKDTree (which I really needed, the KDTree
implementation is so slow as to be unusable for analysing a long molecular
dynamics trajectory). When I started, I thought I was only porting over
one method of KDTree, so I strove to keep the code changes to a minimum
(hence not abstracting out the incremental updates to the minimum/maximum
distances between the query point and a node's bounding hyperrectangle) and
to keep a very similar style to the existing code (hence the int's,
double's, malloc/free, etc.). I only went about porting over the rest of
the methods essentially to submit a complete enhancement to SciPy instead
of just the juicy bit I cared about (after parsing Anne's code, I figured
90% of the effort was done already). I also agree that Anne's code is
really transparent, and should stay no matter what, but speed does matter a
fair bit to users of kd-trees. And in my defense with respect to the code
being "full of bugs", I tried hard to make all the unit tests pass on my
machine before submitting and was completely unaware of the nightmares of
32 vs 64-bit ints in Windows 64.
That said, now that the initial effort is complete, it's obvious that a
clear pattern recurs in the kd-tree walking algorithms: keeping incremental
track of the minimum/maximum distances between two hyperrectangles as
they're successively split (alternatively, a hyperrectangle and a point).
If that primitive is abstracted, the code becomes a lot more readable
again. I've had a first go at what such an abstraction might look like,
and updated the code to query_ball_tree. Does this look like a way forward
for enhancing the code's maintainability?
Here are another couple of issues / questions:
* Is there any difference between <np.float64_t*>np.PyArray_DATA(my_array)
and &my_array ? If not, the second choice is clearly preferable.
* I get the issues with int vs np.int vs np.npy_intp, but is there really
any SciPy-supported platform in which double doesn't mean np.float64_t?
* As for memory management, it's probably easier to let NumPy & Python
handle most things, i.e. make Rectangle store two NumPy arrays. This needs
a bit of surgery to do right, but I've tried to show what I mean in the
query_ball_tree version I'm submitting now.
* I fixed a small glitch with calls to realloc(): Sturla's changes for
handling memory errors ended up setting the first argument to realloc to
NULL, which effectively wipes out the contents of the data that were there
before the resize.
* There seems to be something funny about raw_mins and raw_maxes: they are
member variables of cKDTree, but the data they point to seems to me like it
should go out of scope at the end of cKDTree.__init__. Am I
misunderstanding something here?
All the best,
On Sun, Jul 15, 2012 at 12:29 AM, Sturla Molden <email@example.com> wrote:
> Den 15.07.2012 00:36, skrev Sturla Molden:
> Thanks. On Windows 64, query_pair returns bad results depending on the
> size of the tree. I don't know why but Patrick's code hade the same problem.
> Which brings me to the other issue I'd like to discuss...
> In its current state, Patrick's enhancement to cKDTree (IMHO) is close to
> unmaintainable. It might be 'fast', but it is not estetically nice looking
> code (no pun intended), about 2500 lines of low-level C masquerading as
> Cython, very difficult to read and follow with respect to logic, it was
> full of bugs (and might still be), and still don't pass all the tests on my
> computer (even though I have been debugging it quite extensively).
> Who is going to maintain it? Patrick? Ralph? Anne? (I am not
> On the other hand, Anne's KDTree.py is an example of very beautiful Python
> code, and very easy to read and understand. KDTree is also robust and
> rather bug-free, after having been tested for quite a long time, and more
> or less feature-complete for a KDTree. But it is a bit slow, courtesy to
> How fast do we need it to be?
> I think it might be better to just gently 'Cythonize' out the worst
> bottlenecks in KDTree.py, but leave the rest as it is. It might not be
> equally fast as Patrick's rewrite of cKDTree enhancement, but will be 100
> times easier to maintain, and probably "fast enough" for most usecases.
> cKDTree was intended as a very fast kd-tree for nearest-neighbor queries.
> It was never intended as general-purpose kd-tree like, well, KDTree.
> Currently in SciPy master it is very "lean and mean" for its particular
> purpose. (It could still use some minor updating, such as 64-bit support.)
> I think at least this should be discussed before this major rewrite is
> pulled into SciPy master. I am not saying Patrick's rewrite is bad anbs
> should not be pulled into SciPy. But I don't think we should drop code into
> SciPy if it will never be properly maintained. It must be maintainable for
> years ahead.
> SciPy-Dev mailing list
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the SciPy-Dev