[Numpy-discussion] diff of find of diff trick

Norbert Nemec Norbert.Nemec.list@gmx...
Thu Oct 11 10:34:41 CDT 2007


Basically, what you want to do is a histogram. Numpy has that
functionality built in. However: the version built in to numpy is about
as suboptimal as yours. The problem is the unnecessary sorting of the data.

In principle, a histogram does not need any sorting, so it could be done
in strictly O(N). I don't see any simply way of doing so efficiently in
numpy without resorting to a costly loop. We had a discussion about this
a while ago but I cannot remember any resolution. The proper thing to do
would probably be a C routine.

If performance is crucial to you, I would suggest using weave. A minimal
example as starting point is given below.

Greetings,
Norbert

---------------------------
#!/usr/bin/env python

import numpy
import scipy.weave
import scipy.weave.converters

def histogram(data, nR):
    res = numpy.zeros(nR,int)
    dataflat = data.flatten()
    Ndata = len(dataflat)
    scipy.weave.inline("""
        for(int i=0;i<Ndata;i++) {
            int d = dataflat(i);
            if(d < nR && d >= 0) {
                res(d) += 1;
            }
        }
        """,
        arg_names = ["dataflat","res","Ndata","nR"],
        type_converters = scipy.weave.converters.blitz,
    )
    return res



a = numpy.array([0,3,1,2,3,5,2,3,2,4,5,1,3,4,2])
print a
print histogram(a,4)
---------------------------





Robin wrote:
> Hello,
>
> As a converting MATLAB user I am bringing over some of my code. A core
> function I need is to sample the probability distribution of a given
> set of data. Sometimes these sets are large so I would like to make it
> as efficient as possible. (the data are integers representing members
> of a discrete space)
>
> In MATLAB the best way I found was the "diff of find of diff" trick
> which resulted in the completely vectorised solution (below). Does it
> make sense to translate this into numpy? I don't have a feel yet for
> what is fast/slow - are the functions below built in and so quick (I
> thought diff might not be).
>
> Is there a better/more pythonic way to do it?
>
> --------
> function Pr=prob(data, nR)
>
> Pr    = zeros(nR,1);
> % diff of find of diff trick for counting number of elements
> temp     = sort(data(data>0));    % want vector excluding P(0)
> dtemp    = diff([temp;max(temp)+1]);
> count     = diff(find([1;dtemp]));
> indx     = temp(dtemp>0);
> Pr(indx)= count ./ numel(data);    % probability
> --------
>
> Thanks
>
> Robin
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>   



More information about the Numpy-discussion mailing list