[SciPy-dev] Should ndimage.measurements.* should return lists if index is a list?

Thouis (Ray) Jones thouis@broad.mit....
Fri May 8 13:43:15 CDT 2009


I've just pushed a new version to
http://broad.mit.edu/~thouis/scipy.git ndimage_measurements_rewrite

This includes rewrites of:  sum, mean, variance, standard_deviation,
minimum, maximum, minimum_position, maximum_position, center_of_mass,
and histogram in pure python.  It also adds a labeled_comprehension()
function.

The histogram() code could be much improved, speed-wise.  It uses
labeled_comprehension and numpy.histogram, but could use similar
bincount tricks as have been done for sum et al.

Some timing comparisons below, with test script attached.

Ray


Timing information (1K x 1K image), secs per call.
New:
variance 0.0620911860466
minimum 0.48847284317
maximum_position 0.588409199715
extrema 0.813310542107
center_of_mass 0.220837605
histogram 5.70109256506

Old:
variance 0.0640315508842
minimum 0.0331222701073
maximum_position 1.07506107569
extrema 2.27282817364
center_of_mass 0.0554035758972
histogram 0.0865430688858




On Tue, May 5, 2009 at 14:54,  <josef.pktd@gmail.com> wrote:
> On Mon, May 4, 2009 at 5:11 PM, Thouis (Ray) Jones <thouis@broad.mit.edu> wrote:
>> On Mon, May 4, 2009 at 16:43,  <josef.pktd@gmail.com> wrote:
>>> The use case, I wrote my application for, also allowed for labels that
>>> are strings. The return inverse of unique1d is very flexible since it
>>> can handle many different types including structured arrays, but for
>>> numeric labels, I just found out, that it is a large speed penalty. If
>>> the labels are already integers or can be easily converted to
>>> integers, then not using unique1d to create the integer labels should
>>> be much faster.
>>
>> Using bincount probably requires calling unique1d (or some equivalent)
>> for this usage, since otherwise the call to bincount could create a
>> very large output (2^32 or even 2^64 bins).
>>
>> Ray Jones
>
> In my version, I started to use a conditional call to unique1d, if the
> type of the labels is not int (to handle float and strings). This
> could be extended to a very large max(labels).  If the label type is
> int then, I just use bincount without unique1d.
>
> I tried it with 50% unused numbers, and the speed of bincount is
> comparable to ndimage, labels are generated with
> factors = (100+2*np.random.randint(nfact,size=nobs))
> #.astype('S6')#.astype(float)
> I tried with up to 4 million observations and 10000 unique labels.
>
> I'm starting to get confused with the many different versions, that I
> have now for this, the two typical examples are below. ndimage looks
> very picky which arrays it accepts, every once in a while I get a not
> implemented type exception or even a segfault.
>
> I also attach my full timing and test script.
>
> Josef
>
>
> def labelmean_ndi(factors, values, index=None):
>    # works also for string labels in ys, but requires 1D
>    # from mailing list scipy-user 2009-02-11
>    unilinv = np.asanyarray(factors)
>
>    if not np.issubdtype(unilinv.dtype, int):
>        unil, unilinv = np.unique1d(factors, return_inverse=1)
>        if index is None:
>            index = np.arange(len(unil)) #unil
>    elif index is None:
>        index = np.arange(np.max(unilinv)+ 1)
>
>    labelmeans = np.array(ndimage.mean(values, labels=unilinv, index=index))
>    return labelmeans
>
> def groupstatsbin(factors, values, ddof=0, stat='mvnx', drop=True):
>    '''uses np.bincount, assumes factors/labels are integers,
>    create labels with unique1d return_inverse if string labels
>    '''
>    #n = len(factors)
>
>    rind = np.asanyarray(factors)
>    #note: bincount uses complete list of integers, maybe range(max(factor)+1)
>    if not np.issubdtype(rind.dtype, int):
>        ix, rind = np.unique1d(factors, return_inverse=1)
> ##    else:
> ##        rind = factors
>    #print rind.shape
>    res = []
>    if 'c' in stat or 'm' in stat or 'v' in stat:
>        gcount = np.bincount(rind)
>        if drop:
>            keepidx = np.nonzero(gcount)
>        if 'c' in stat:
>            if drop:
>                res.append(gcount[keepidx])
>            else:
>                res.append(gcount)
>
>    if 'm' in stat or 'v' in stat:
>        gmean = np.bincount(rind, weights=values)/ (1.0*gcount)
>        if np.max(np.abs(gmean)) > 1e6:
>            # improve numerical precision if means are large
>            # (to "cheat" on NIST anova test examples)
>            meanarr = gmean[rind]
>            gmean += np.bincount(rind, weights=(values-meanarr)) / (1.0*gcount)
>        if 'm' in stat:
>            if drop:
>                res.append(gmean[keepidx])
>            else:
>                res.append(gmean)
>
>    if 'v' in stat:
>        meanarr = gmean[rind]
>        withinvar = np.bincount(rind, weights=(values-meanarr)**2) /
> (1.0*gcount-ddof)
>        if drop:
>            res.append(withinvar[keepidx])
>        else:
>            res.append(withinvar)
>
>    if 'n' in stat or 'x' in stat:
>        #calculate min, max per factor
>        sortind = np.lexsort((values, rind))
>        fs = rind[sortind]
>        vs = values[sortind]
>        fsd = np.hstack((np.inf,np.diff(fs),np.inf))
>    if 'n' in stat:
>        #minimum
>        res.append(vs[fsd[:-1] != 0])
>    if 'x' in stat:
>        #maximum
>        res.append(vs[fsd[1:] != 0])
>    return res
>
> _______________________________________________
> Scipy-dev mailing list
> Scipy-dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ndimage_test.py
Type: application/octet-stream
Size: 1123 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/scipy-dev/attachments/20090508/ee6b6398/attachment.obj 


More information about the Scipy-dev mailing list