[SciPy-dev] Should ndimage.measurements.* should return lists if index is a list?

Thouis (Ray) Jones thouis@broad.mit....
Fri May 15 09:46:24 CDT 2009


I was going to rewrite ndimage.histogram() to be faster, but I think
it's worth leaving as is to track numpy.histogram() which has been in
flux recently.

I filed a scipy ticket: 945

Ray Jones


On Fri, May 8, 2009 at 14:43, Thouis (Ray) Jones <thouis@broad.mit.edu> wrote:
> I've just pushed a new version to
> http://broad.mit.edu/~thouis/scipy.git ndimage_measurements_rewrite
>
> This includes rewrites of:  sum, mean, variance, standard_deviation,
> minimum, maximum, minimum_position, maximum_position, center_of_mass,
> and histogram in pure python.  It also adds a labeled_comprehension()
> function.
>
> The histogram() code could be much improved, speed-wise.  It uses
> labeled_comprehension and numpy.histogram, but could use similar
> bincount tricks as have been done for sum et al.
>
> Some timing comparisons below, with test script attached.
>
> Ray
>
>
> Timing information (1K x 1K image), secs per call.
> New:
> variance 0.0620911860466
> minimum 0.48847284317
> maximum_position 0.588409199715
> extrema 0.813310542107
> center_of_mass 0.220837605
> histogram 5.70109256506
>
> Old:
> variance 0.0640315508842
> minimum 0.0331222701073
> maximum_position 1.07506107569
> extrema 2.27282817364
> center_of_mass 0.0554035758972
> histogram 0.0865430688858
>
>
>
>
> On Tue, May 5, 2009 at 14:54,  <josef.pktd@gmail.com> wrote:
>> On Mon, May 4, 2009 at 5:11 PM, Thouis (Ray) Jones <thouis@broad.mit.edu> wrote:
>>> On Mon, May 4, 2009 at 16:43,  <josef.pktd@gmail.com> wrote:
>>>> The use case, I wrote my application for, also allowed for labels that
>>>> are strings. The return inverse of unique1d is very flexible since it
>>>> can handle many different types including structured arrays, but for
>>>> numeric labels, I just found out, that it is a large speed penalty. If
>>>> the labels are already integers or can be easily converted to
>>>> integers, then not using unique1d to create the integer labels should
>>>> be much faster.
>>>
>>> Using bincount probably requires calling unique1d (or some equivalent)
>>> for this usage, since otherwise the call to bincount could create a
>>> very large output (2^32 or even 2^64 bins).
>>>
>>> Ray Jones
>>
>> In my version, I started to use a conditional call to unique1d, if the
>> type of the labels is not int (to handle float and strings). This
>> could be extended to a very large max(labels).  If the label type is
>> int then, I just use bincount without unique1d.
>>
>> I tried it with 50% unused numbers, and the speed of bincount is
>> comparable to ndimage, labels are generated with
>> factors = (100+2*np.random.randint(nfact,size=nobs))
>> #.astype('S6')#.astype(float)
>> I tried with up to 4 million observations and 10000 unique labels.
>>
>> I'm starting to get confused with the many different versions, that I
>> have now for this, the two typical examples are below. ndimage looks
>> very picky which arrays it accepts, every once in a while I get a not
>> implemented type exception or even a segfault.
>>
>> I also attach my full timing and test script.
>>
>> Josef
>>
>>
>> def labelmean_ndi(factors, values, index=None):
>>    # works also for string labels in ys, but requires 1D
>>    # from mailing list scipy-user 2009-02-11
>>    unilinv = np.asanyarray(factors)
>>
>>    if not np.issubdtype(unilinv.dtype, int):
>>        unil, unilinv = np.unique1d(factors, return_inverse=1)
>>        if index is None:
>>            index = np.arange(len(unil)) #unil
>>    elif index is None:
>>        index = np.arange(np.max(unilinv)+ 1)
>>
>>    labelmeans = np.array(ndimage.mean(values, labels=unilinv, index=index))
>>    return labelmeans
>>
>> def groupstatsbin(factors, values, ddof=0, stat='mvnx', drop=True):
>>    '''uses np.bincount, assumes factors/labels are integers,
>>    create labels with unique1d return_inverse if string labels
>>    '''
>>    #n = len(factors)
>>
>>    rind = np.asanyarray(factors)
>>    #note: bincount uses complete list of integers, maybe range(max(factor)+1)
>>    if not np.issubdtype(rind.dtype, int):
>>        ix, rind = np.unique1d(factors, return_inverse=1)
>> ##    else:
>> ##        rind = factors
>>    #print rind.shape
>>    res = []
>>    if 'c' in stat or 'm' in stat or 'v' in stat:
>>        gcount = np.bincount(rind)
>>        if drop:
>>            keepidx = np.nonzero(gcount)
>>        if 'c' in stat:
>>            if drop:
>>                res.append(gcount[keepidx])
>>            else:
>>                res.append(gcount)
>>
>>    if 'm' in stat or 'v' in stat:
>>        gmean = np.bincount(rind, weights=values)/ (1.0*gcount)
>>        if np.max(np.abs(gmean)) > 1e6:
>>            # improve numerical precision if means are large
>>            # (to "cheat" on NIST anova test examples)
>>            meanarr = gmean[rind]
>>            gmean += np.bincount(rind, weights=(values-meanarr)) / (1.0*gcount)
>>        if 'm' in stat:
>>            if drop:
>>                res.append(gmean[keepidx])
>>            else:
>>                res.append(gmean)
>>
>>    if 'v' in stat:
>>        meanarr = gmean[rind]
>>        withinvar = np.bincount(rind, weights=(values-meanarr)**2) /
>> (1.0*gcount-ddof)
>>        if drop:
>>            res.append(withinvar[keepidx])
>>        else:
>>            res.append(withinvar)
>>
>>    if 'n' in stat or 'x' in stat:
>>        #calculate min, max per factor
>>        sortind = np.lexsort((values, rind))
>>        fs = rind[sortind]
>>        vs = values[sortind]
>>        fsd = np.hstack((np.inf,np.diff(fs),np.inf))
>>    if 'n' in stat:
>>        #minimum
>>        res.append(vs[fsd[:-1] != 0])
>>    if 'x' in stat:
>>        #maximum
>>        res.append(vs[fsd[1:] != 0])
>>    return res
>>
>> _______________________________________________
>> Scipy-dev mailing list
>> Scipy-dev@scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>


More information about the Scipy-dev mailing list