[Numpy-discussion] No Copy Reduce Operations

Luis Pedro Coelho lpc@cmu....
Sat Jul 26 14:03:35 CDT 2008


Hello all,

Numpy arrays come with several reduce operations: sum(), std(), argmin(), 
min(), ....

The traditional implementation of these suffers from two big problems: It is 
slow and it often allocates intermediate memory. I have code that is failing 
with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly 
handle arrays with 100 million entries (have a couple of million objects * 20 
features per object = 100 million doubles), so this is a real problem for me.

This being open-source, I decided to solve the problem. My first idea was to 
try to improve the numpy code. I failed to see how to do that while 
supporting everything that numpy does (multiple types, for example), so I 
started an implementation of reduce operations that uses C++ templates to 
make code optimised into the types it actually uses, choosing the right 
version to use at run time. In the spirit or release-early/release-often, I 
attach the first version of this code that runs.

BASIC IDEA
===============

ndarray.std does basically the following (The examples are in pseudo-code even 
though the implementation happens to be in C):

def stddev(A):
	mu = A.mean()
	diff=(A-mu)
	maybe_conj=(diff if not complex(A) else diff.conjugate())
	diff *= maybe_conj
	return diff.sum()

With a lot of temporary arrays. My code does:

def stddev(A):
	mu = A.mean() # No getting around this temporary
	std = 0
	for i in xrange(A.size):
		diff = (A[i]-mu)
		if complex(A):
			diff *= conjugate(diff)
		else:
			diff *= diff
		std += diff
	return sqrt(diff/A.size)

Of course, my code does it while taking into account the geometry of the 
array. It handles arrays with arbitrary strides.

I do it while avoiding copying the array at any time (while most of the 
existing code will transpose/copy the array so that it is well behaved in 
memory).


IMPLEMENTATION
===============

I have written some template infrastructure so that, if I wanted a very fast 
entropy calculation, on a normalised array, you could do:

template < ... >
void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator 
results) {
	while (data != past) {
		if (*data) *result += *data * std::log(*data);
		++data;
		++result;
	}
}

and the machinery will instantiate this in several variations, deciding at 
run-time which one to call. You just have to write a C interface function 
like

PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds)
{
    int axis=NPY_MAXDIMS;
    PyArray_Descr *dtype=NULL;
    PyArrayObject *out=NULL;
    static char *kwlist[] = {"array","axis", "dtype", "out", NULL};

    if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist,
                                     &self,
                                     PyArray_AxisConverter,&axis,
                                     PyArray_DescrConverter2, &dtype,
                                     PyArray_OutputConverter, &out)) {
        Py_XDECREF(dtype);
        return NULL;
    }

    int num = _get_type_num_double(self->descr, dtype);
    Py_XDECREF(dtype);
    return compress_dispatch<EntropyCompute>(self, out, axis, num, 
EmptyType()); // This decides which function to call
}


For contiguous arrays, axis=None, this becomes

void compute_entropy(Base* data, Base* past, Results* result) {
	while (data != past) {
		if (*data) *result += *data * std::log(*data);
		++data;
	}
}

which is probably as fast as it can be.

If the array is not contiguous, this becomes

void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, 
Results* result) {
	while (data != past) {
		if (*data) *result += *data * std::log(*data);
		++data;
	}
}

where numpy_iterator is a type that knows how to iterate over numpy arrays 
following strides.

If axis is not None, then the result will not be a single value, it will be an 
array. The machinery will automatically create the array of the right size 
and pass it to you with so that the following gets instantiated:

void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past, 
numpy_iterator<ResultType> results) {
	while (data != past) {
		if (*data) *results += *data * std::log(*data);
		++data;
		++results;
	}
}

The results parameter has the strides set up correctly to iterate over 
results, including looping back when necessary so that the code works as it 
should.

Notice that the ++results operation seems to be dropping in and out. In fact, 
I was oversimplifying above. There is no instantiation with Result*, but with 
no_iteration<Result> which behaves like Result*, but with an empty operator 
++(). You never change your template implementation.

(The code above was for explanatory purposes, not an example of working code. 
The interface I actually used takes more parameters which are not very 
important for expository purposes. This allows you to, for example, implement 
the ddof parameter in std()).

ADVANTAGES
===========

For most operations, my code is faster (it's hard to beat ndarray.sum(), but 
easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an 
amd 64 bit machine both newer than one year (you can test it on your machine 
by runnning profile.py). For some specific operations, like summing along a 
non-last axis on a moderately large array, it is slower (I think that the 
copy&process tactic might be better in this case than the no-copy/one pass 
operation I am using). In most cases I tested, it is faster. In particular, 
the basic case (a well-behaved array), it is faster.

More important than speed (at least for me) is the fact that this does not 
allocate more memory than needed. This will not fail with OOM errors.

It's easy to implement specific optimisations. For example, replace a sum 
function for a specific case to call AMD's framewave SIMD library (which uses 
SIMD instructions):

void compute_sum(short* data, short* past, no_iteration<short> result) {
	fwiSum_16s_C1R  ( data, sizeof(short), past-start, &*result);
}

or, compute the standard deviation of an array of boolean with a single pass 
(sigma = sqrt(p(1-p))):

void compute_std(bool* data, bool* past, no_iteration<ResType> result) {
    size_type N = (past-data);
    size_type pos = 0;
    while (data != past) {
        if (*data) ++pos;
        ++data;
    }
    *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N)));
}

NOT IMPLEMENTED (YET)
======================

Non-aligned arrays are not implemented.
out arrays have to be well behaved.

My current idea is to compromise and make copies in this case. I could also, 
trivially, write a thing that handled those cases without copying, but I 
don't think it's worth the cost in code bloat.

* argmax()/argmin(). This is a bit harder to implement than the rest, as it 
needs a bit more machinery. I think it's possible, but I haven't gotten 
around to it.

* ptp(). I hesitate whether to simply do it in Python:

def ncr_ptp(A,axis=None,dtype=None,out=None):
    res=ncr.max(A,axis=axis,dtype=dtype,out=out)
    min=ncr.min(A,axis=axis,dtype=dtype)
    res -= max
    return res

Does two passes over the data, but no extra copying. I could do the same using 
the Python array API, of course.

DISADVANTAGES
==============

It's C++. I don't see it as a technical disadvantage, but I understand some 
people might not like it. If this was used in the numpy code base, it could 
replace the current macro language (begin repeat // end repeat), so the total 
number of languages used would not increase.

It generates too many functions. Disk is cheap, but do we really need a well 
optimised version of std() for the case of complex inputs and boolean output? 
What's a boolean standard deviation anyway?.  Maybe some tradeoff is 
necessary: optimise the defaults and make others possible.

BUGS
=====

I am not correctly implementing the dtype parameter. I thought it controlled 
the type of the intermediate results, but if I do

import numpy
A=numpy.arange(10)
A.mean(dtype=numpy.int8)

I get

4.5

which surprises me!

Furthermore,

A.mean(dtype=numpy.int8).dtype

returns

dtype('float64')

In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4

(This might be considered a numpy bug --- BTW, i am not running subversion for 
this comparison).

* I also return only either python longs or python floats. This is a trivial 
change, I just don't know all the right function names to return numpy types.

* A couple of places in the code still have a FIXME on them

FUTURE
=======

I consider this code proof-of-concept. It's cool, it demonstrate that the idea 
works, but it is rough and needs cleaning up. There might even be bugs in it 
(especially the C interface with Python is something I am not so familiar 
with)!

I see three possible paths for this:
	(1) You agree that this is nice, it achieves good results and I should strive 
to integrate this into numpy proper, replacing the current implementation. I 
would start a branch in numpy svn and implement it there and finally merge 
into the main branch. I would be perfectly happy to relicense to BSD if this 
is the case.
	One could think of the array-to-array (A+B, A+=2, A != B,...) operations 
using a similar scheme
	This would imply that part of the C API would be obsoleted: the whole 
functions array would stop making sense. I don't know how widespread it's 
used.

	(2) I keep maintaining this as a separate package and whoever wants to use 
it, can use it. I will probably keep it GPL, for now.
	Of course, at some later point in time, one could integrate this into the 
main branch (Again, in this case, I am happy to relicense).

	(3) Someone decides to take this as a challenge and re-implements ndarray.std 
and the like so that it uses less memory and is faster, but does it still in 
C. I am not much of a C programmer, so I don't see how this could be done 
without really ugly reliance on the preprocessor, but maybe it could be done.

What do you think?

bye,
Luís Pedro Coelho
PhD Student in Computational Biology
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ncreduce-0.01.tar.gz
Type: application/x-tgz
Size: 5542 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080726/165d7817/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: profile.py
Type: application/x-python
Size: 4728 bytes
Desc: not available
Url : http://projects.scipy.org/pipermail/numpy-discussion/attachments/20080726/165d7817/attachment-0003.bin 
-------------- next part --------------
Values are fold improvements: >1 means ncreduce is faster, <1 means ncreduce is slower
Columns are different array sizes: (4x4) (40x4) (4000x10) (10x4000) (4000x10000)
Rows are different types of reduce operation:
 A.f()
 A.T.f()
 A.f(0)
 A.f(1)
 A.T.f(1)
 A[2].f()
 A[:,2].f()


For SUM function
[[  3.8070922    3.74078624   3.18315868   3.21160784   2.43980108]
 [  4.46617647   4.22613065   8.31738401   6.8606588   29.21819473]
 [  1.9447132    1.56371814   0.96617167   3.01338318   7.0270556 ]
 [  2.51067616   4.4812749    6.82775021   2.81078355   2.02944134]
 [  1.91545392   1.62806468   0.87334081   2.54212652   0.80367664]
 [  3.6557377    3.52655539   3.45639535   3.09458466   3.09918836]
 [  3.20726783   2.93624557   1.0432128    3.20680628   0.96305311]]

For MEAN function
[[  4.53997379   4.33751425   3.18691456   3.21325964   2.61187721]
 [  5.09466667   4.86712486   8.09895611   6.66878963  29.56070697]
 [  1.32862454   1.2232241    0.95682217   2.77428247   6.59373701]
 [  1.38306992   2.07929176   5.6042247    2.67118025   2.01389827]
 [  1.32216749   1.24022472   0.8609297    2.38601915   0.79781743]
 [  4.26244953   4.21212121   4.28808864   3.22779923   3.19200338]
 [  4.05635649   3.68317972   1.08496276   4.00253807   0.95483616]]

For STD function
[[ 16.37931034  13.23200859   8.515398     8.6650662    5.56770624]
 [ 17.59516616  13.92785794  11.48111037  10.94399828  16.72880933]
 [  7.96362619   4.69929338   2.06533419   2.63442178   5.35333711]
 [ 12.23589744  12.1507772    8.82073033   8.74585517   6.89534047]
 [  7.9735658    4.66921665   1.54624704   2.49588909   1.30503195]
 [ 17.3400936   17.64123377  17.10280374   5.49334051   4.91767663]
 [ 14.07474227  10.96748768   1.73907847  13.80691299   1.08550626]]

For VAR function
[[ 12.75496689  11.19303797   8.23840394   8.17942839   5.35515979]
 [ 14.72340426  11.67340426  11.47882782  11.10354571  16.79689987]
 [  7.43418014   4.3738514    2.07330643   2.68918206   5.09448361]
 [ 11.42946429  13.17880795   9.98134987   8.36491556   6.79879305]
 [  7.52822581   4.28857407   1.4730624    2.47868726   1.1964753 ]
 [ 14.93135725  14.64387917  14.28         5.2311021    4.79074167]
 [ 12.18278146   9.37636544   1.65813694  11.40481928   1.04396634]]

For MAX function
[[  4.18282548   4.1835206    3.064325     3.09957884   1.89579732]
 [  5.3807267    5.09299191   8.16320492   6.81480951  23.32648022]
 [  2.22292994   1.53729977   0.82438672   1.95009491   6.05179257]
 [  2.95664467   4.09146341   3.55795579   2.05772538   1.74708854]
 [  2.35276753   1.66998507   0.86066282   1.75172327   0.80780005]
 [  4.53484603   4.39833333   3.88140556   3.10858712   3.08567761]
 [  4.31647399   3.50316857   1.21125174   3.87866109   0.93919081]]

For MIN function
[[  5.12647059   4.29441624   3.03089161   3.05712249   1.89781197]
 [  5.23816794   4.79036458   8.18441573   6.75257879  23.33387201]
 [  2.20617111   1.57129757   0.86588154   2.0558663    6.3874678 ]
 [  2.96841155   3.84391192   3.35497735   1.99594036   1.74332781]
 [  2.25244073   1.69904857   0.87264242   1.82947976   0.80815663]
 [  4.62801932   4.39569536   4.34477124   3.12006903   3.10487051]
 [  3.97965116   3.51935081   1.16092389   4.00438596   0.93541891]]

For ALL function
[[  2.79002467e-02   3.76257310e+00   9.28888889e+01   8.84859813e+01
    8.42846792e+04]
 [  4.99675325e+00   5.28160000e+00   1.31449298e+02   1.89387255e+02
    5.50127171e+05]
 [  2.30509745e+00   1.61616675e+00   3.98208860e-01   2.52603138e+00
    2.41908320e+00]
 [  2.82739726e+00   4.85934820e+00   3.36824142e+01   5.78263158e+01
    1.46416294e+03]
 [  1.95994914e+00   1.75441501e+00   4.47726973e-01   2.20169301e+00
    9.43527633e-01]
 [  4.18080000e+00   4.11092715e+00   4.09602649e+00   1.30212072e+01
    2.61726974e+01]
 [  3.84461538e+00   3.96411856e+00   1.29984496e+01   3.94307692e+00
    5.45761689e+01]]

For ANY function
[[  4.54672897e+00   4.97651007e+00   9.48595944e+01   9.28051948e+01
    8.83522646e+04]
 [  5.02138158e+00   5.54304636e+00   1.40057190e+02   1.87542130e+02
    5.78053539e+05]
 [  2.42564910e+00   1.70693277e+00   4.13649990e-01   2.52723749e+00
    2.52861281e+00]
 [  2.97480620e+00   5.02946429e+00   3.39985881e+01   5.69674115e+01
    1.50428346e+03]
 [  2.29548872e+00   1.81257078e+00   4.52254796e-01   2.20091677e+00
    9.46904281e-01]
 [  4.29111842e+00   4.16582064e+00   4.13087248e+00   1.33412162e+01
    2.89243697e+01]
 [  3.91051805e+00   3.96214511e+00   1.41828299e+01   3.93322734e+00
    6.09200000e+01]]

python profile.py 10948.30s user 625.08s system 99% cpu 3:13:06.33 total
-------------- next part --------------
Values are fold improvements: >1 means ncreduce is faster, <1 means ncreduce is slower
Columns are different array sizes: (4x4) (40x4) (4000x10) (10x4000) (4000x10000)
Rows are different types of reduce operation:
 A.f()
 A.T.f()
 A.f(0)
 A.f(1)
 A.T.f(1)
 A[2].f()
 A[:,2].f()


For SUM function
[[  4.29501085   4.20192308   2.84235939   2.88138342   2.29455524]
 [  5.25827815   5.16287879   6.16807616   6.3403272   14.58060969]
 [  2.21042831   1.50277949   0.44599368   2.76584319   1.32321157]
 [  3.03638645   5.85714286  17.28581874   2.82322888   2.37914623]
 [  2.10303588   1.59593975   0.43697892   2.48825259   0.7626602 ]
 [  4.18485523   4.01573034   4.12837838   3.14817601   3.04448643]
 [  3.71543086   3.2927242    0.67116867   3.5546875    1.12240952]]

For MEAN function
[[  5.30393996   5.08747856   3.07230251   3.05563231   2.32033496]
 [  6.37074148   5.9165247    6.12403101   6.28776978  13.90174399]
 [  1.21764063   1.07815771   0.44468226   2.46024723   1.34025926]
 [  1.33383189   2.1530469    8.07927117   2.73193251   2.42263582]
 [  1.19756428   1.09569715   0.44064895   2.26017632   0.74816577]
 [  5.19438878   5.11608961   5.05210421   3.33246946   3.13600136]
 [  4.45698925   4.14195584   0.73804391   4.40414508   1.16802168]]

For STD function
[[ 17.99813433  16.37692308   7.62219466   7.631936     7.28233251]
 [ 21.27087576  17.59594384   9.22584327   9.31559985  12.67600126]
 [  8.86459803   5.83032235   1.86635194   2.37492871   1.82752484]
 [ 14.94068802  12.8956229    8.47783179   7.37456519   7.95656419]
 [  9.27318841   5.22204908   0.93005222   2.35775864   1.21045558]
 [ 19.32653061  19.59915612  18.70140281   6.42620232   5.58600269]
 [ 15.66890756  12.18216561   1.66883246  14.60942249   2.57191279]]

For VAR function
[[ 16.59753593  14.16326531   7.90385245   7.05429577   7.07134674]
 [ 18.39917695  15.50394945   9.25239751   9.21568903  12.79580458]
 [  8.60134128   5.53321364   1.84634827   2.51325626   1.83403096]
 [ 14.28641975  16.80358535  11.29057751   7.46716752   8.01444069]
 [  8.5099926    4.92359729   0.91359572   2.5071937    1.26863978]
 [ 17.28387097  17.0867679   16.32217573   6.15619088   5.4340506 ]
 [ 13.72299652  10.93495935   1.58346303  12.90923825   1.56830078]]

For MAX function
[[  5.20682303   4.9273743    2.65534406   2.64120879   1.972041  ]
 [  5.89339019   5.66117216   5.76669586   5.91858423  12.36736688]
 [  2.62611276   1.8565051    0.51823534   2.79180066   1.45855696]
 [  3.61738003   5.61111111   4.89628282   2.00837979   1.78166991]
 [  2.70691334   1.95046235   0.4998023    2.43290857   0.82916059]
 [  5.16556291   5.08944954   5.13443396   3.00861101   2.80622541]
 [  4.6120332    4.03327496   0.85167674   4.32677165   1.14838906]]

For MIN function
[[  5.14893617   4.74499089   2.75640497   2.71304858   2.0904718 ]
 [  6.03862661   5.56284153   5.65100334   5.751155    12.2645859 ]
 [  2.72997033   1.83545918   0.51301053   2.79813543   1.49883452]
 [  3.65349144   5.54545455   4.87626525   1.9751425    1.77834036]
 [  2.62728146   1.93559097   0.4968886    2.43326653   0.84075323]
 [  5.07658643   4.91314031   4.78118162   2.90317812   2.69502462]
 [  4.36811024   3.71568627   0.82321831   4.01663586   1.15417859]]

For ALL function
[[  3.01151344e-02   6.80778589e+00   1.61822384e+02   1.61588808e+02
    1.36934336e+05]
 [  6.09692671e+00   7.12165450e+00   2.25415909e+02   3.05729167e+02
    3.79104862e+05]
 [  2.56551724e+00   1.85556995e+00   4.66932320e-01   3.53097573e+00
    8.77598466e-01]
 [  3.47936085e+00   8.33626098e+00   9.45616438e+01   1.55502577e+02
    5.36139550e+03]
 [  2.52413793e+00   1.99930314e+00   4.82473875e-01   3.33456449e+00
    5.92418062e-01]
 [  4.86363636e+00   4.91084337e+00   4.89156627e+00   1.95740741e+01
    3.94606987e+01]
 [  4.11328125e+00   4.32681018e+00   1.75753968e+01   4.15145228e+00
    2.69920477e+01]]

For ANY function
[[  4.93668122e+00   5.90997567e+00   1.68126551e+02   1.63245783e+02
    1.42517746e+05]
 [  6.18491484e+00   6.83490566e+00   2.25388764e+02   3.13054374e+02
    3.77207517e+05]
 [  2.62072435e+00   1.79820051e+00   4.81102465e-01   3.57245295e+00
    8.86472009e-01]
 [  3.57994580e+00   7.91734417e+00   7.90528971e+01   9.62805049e+01
    5.50243880e+03]
 [  2.59674134e+00   1.95604396e+00   4.90362418e-01   3.38196965e+00
    5.94254604e-01]
 [  4.94285714e+00   4.84671533e+00   4.83132530e+00   2.07274939e+01
    4.46569343e+01]
 [  4.04887984e+00   4.33544304e+00   1.83661088e+01   4.14978903e+00
    2.76757322e+01]]

3199.12s user 552.47s system 99% cpu 1:02:38.99 total


More information about the Numpy-discussion mailing list