[Numpy-discussion] No Copy Reduce Operations
Bruce Southey
bsouthey@gmail....
Mon Jul 28 09:31:49 CDT 2008
Luis Pedro Coelho wrote:
> Hello all,
>
> Numpy arrays come with several reduce operations: sum(), std(), argmin(),
> min(), ....
>
> The traditional implementation of these suffers from two big problems: It is
> slow and it often allocates intermediate memory. I have code that is failing
> with OOM (out of memory) exceptions in calls to ndarray.std(). I regularly
> handle arrays with 100 million entries (have a couple of million objects * 20
> features per object = 100 million doubles), so this is a real problem for me.
>
> This being open-source, I decided to solve the problem. My first idea was to
> try to improve the numpy code. I failed to see how to do that while
> supporting everything that numpy does (multiple types, for example), so I
> started an implementation of reduce operations that uses C++ templates to
> make code optimised into the types it actually uses, choosing the right
> version to use at run time. In the spirit or release-early/release-often, I
> attach the first version of this code that runs.
>
> BASIC IDEA
> ===============
>
> ndarray.std does basically the following (The examples are in pseudo-code even
> though the implementation happens to be in C):
>
> def stddev(A):
> mu = A.mean()
> diff=(A-mu)
> maybe_conj=(diff if not complex(A) else diff.conjugate())
> diff *= maybe_conj
> return diff.sum()
>
> With a lot of temporary arrays. My code does:
>
> def stddev(A):
> mu = A.mean() # No getting around this temporary
> std = 0
> for i in xrange(A.size):
> diff = (A[i]-mu)
> if complex(A):
> diff *= conjugate(diff)
> else:
> diff *= diff
> std += diff
> return sqrt(diff/A.size)
>
> Of course, my code does it while taking into account the geometry of the
> array. It handles arrays with arbitrary strides.
>
> I do it while avoiding copying the array at any time (while most of the
> existing code will transpose/copy the array so that it is well behaved in
> memory).
>
>
> IMPLEMENTATION
> ===============
>
> I have written some template infrastructure so that, if I wanted a very fast
> entropy calculation, on a normalised array, you could do:
>
> template < ... >
> void compute_entropy(BaseIterator data, BaseIterator past, ResultsIterator
> results) {
> while (data != past) {
> if (*data) *result += *data * std::log(*data);
> ++data;
> ++result;
> }
> }
>
> and the machinery will instantiate this in several variations, deciding at
> run-time which one to call. You just have to write a C interface function
> like
>
> PyObject* fast_entropy(PyArrayObject *self, PyObject *args, PyObject *kwds)
> {
> int axis=NPY_MAXDIMS;
> PyArray_Descr *dtype=NULL;
> PyArrayObject *out=NULL;
> static char *kwlist[] = {"array","axis", "dtype", "out", NULL};
>
> if (!PyArg_ParseTupleAndKeywords(args, kwds, "O|O&O&O&", kwlist,
> &self,
> PyArray_AxisConverter,&axis,
> PyArray_DescrConverter2, &dtype,
> PyArray_OutputConverter, &out)) {
> Py_XDECREF(dtype);
> return NULL;
> }
>
> int num = _get_type_num_double(self->descr, dtype);
> Py_XDECREF(dtype);
> return compress_dispatch<EntropyCompute>(self, out, axis, num,
> EmptyType()); // This decides which function to call
> }
>
>
> For contiguous arrays, axis=None, this becomes
>
> void compute_entropy(Base* data, Base* past, Results* result) {
> while (data != past) {
> if (*data) *result += *data * std::log(*data);
> ++data;
> }
> }
>
> which is probably as fast as it can be.
>
> If the array is not contiguous, this becomes
>
> void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past,
> Results* result) {
> while (data != past) {
> if (*data) *result += *data * std::log(*data);
> ++data;
> }
> }
>
> where numpy_iterator is a type that knows how to iterate over numpy arrays
> following strides.
>
> If axis is not None, then the result will not be a single value, it will be an
> array. The machinery will automatically create the array of the right size
> and pass it to you with so that the following gets instantiated:
>
> void compute_entropy(numpy_iterator<Base> data, numpy_iterator<Base> past,
> numpy_iterator<ResultType> results) {
> while (data != past) {
> if (*data) *results += *data * std::log(*data);
> ++data;
> ++results;
> }
> }
>
> The results parameter has the strides set up correctly to iterate over
> results, including looping back when necessary so that the code works as it
> should.
>
> Notice that the ++results operation seems to be dropping in and out. In fact,
> I was oversimplifying above. There is no instantiation with Result*, but with
> no_iteration<Result> which behaves like Result*, but with an empty operator
> ++(). You never change your template implementation.
>
> (The code above was for explanatory purposes, not an example of working code.
> The interface I actually used takes more parameters which are not very
> important for expository purposes. This allows you to, for example, implement
> the ddof parameter in std()).
>
> ADVANTAGES
> ===========
>
> For most operations, my code is faster (it's hard to beat ndarray.sum(), but
> easy to beat ndarray.std()) than numpy on both an intel 32 bit machine and an
> amd 64 bit machine both newer than one year (you can test it on your machine
> by runnning profile.py). For some specific operations, like summing along a
> non-last axis on a moderately large array, it is slower (I think that the
> copy&process tactic might be better in this case than the no-copy/one pass
> operation I am using). In most cases I tested, it is faster. In particular,
> the basic case (a well-behaved array), it is faster.
>
> More important than speed (at least for me) is the fact that this does not
> allocate more memory than needed. This will not fail with OOM errors.
>
> It's easy to implement specific optimisations. For example, replace a sum
> function for a specific case to call AMD's framewave SIMD library (which uses
> SIMD instructions):
>
> void compute_sum(short* data, short* past, no_iteration<short> result) {
> fwiSum_16s_C1R ( data, sizeof(short), past-start, &*result);
> }
>
> or, compute the standard deviation of an array of boolean with a single pass
> (sigma = sqrt(p(1-p))):
>
> void compute_std(bool* data, bool* past, no_iteration<ResType> result) {
> size_type N = (past-data);
> size_type pos = 0;
> while (data != past) {
> if (*data) ++pos;
> ++data;
> }
> *result = std::sqrt(ResType(pos)/ResType(N)*(1-ResType(pos)/ResType(N)));
> }
>
> NOT IMPLEMENTED (YET)
> ======================
>
> Non-aligned arrays are not implemented.
> out arrays have to be well behaved.
>
> My current idea is to compromise and make copies in this case. I could also,
> trivially, write a thing that handled those cases without copying, but I
> don't think it's worth the cost in code bloat.
>
> * argmax()/argmin(). This is a bit harder to implement than the rest, as it
> needs a bit more machinery. I think it's possible, but I haven't gotten
> around to it.
>
> * ptp(). I hesitate whether to simply do it in Python:
>
> def ncr_ptp(A,axis=None,dtype=None,out=None):
> res=ncr.max(A,axis=axis,dtype=dtype,out=out)
> min=ncr.min(A,axis=axis,dtype=dtype)
> res -= max
> return res
>
> Does two passes over the data, but no extra copying. I could do the same using
> the Python array API, of course.
>
> DISADVANTAGES
> ==============
>
> It's C++. I don't see it as a technical disadvantage, but I understand some
> people might not like it. If this was used in the numpy code base, it could
> replace the current macro language (begin repeat // end repeat), so the total
> number of languages used would not increase.
>
> It generates too many functions. Disk is cheap, but do we really need a well
> optimised version of std() for the case of complex inputs and boolean output?
> What's a boolean standard deviation anyway?. Maybe some tradeoff is
> necessary: optimise the defaults and make others possible.
>
> BUGS
> =====
>
> I am not correctly implementing the dtype parameter. I thought it controlled
> the type of the intermediate results, but if I do
>
> import numpy
> A=numpy.arange(10)
> A.mean(dtype=numpy.int8)
>
> I get
>
> 4.5
>
> which surprises me!
>
> Furthermore,
>
> A.mean(dtype=numpy.int8).dtype
>
> returns
>
> dtype('float64')
>
> In my implementation, mean(numpy.arange(10),dtype=numpy.int8) is 4
>
> (This might be considered a numpy bug --- BTW, i am not running subversion for
> this comparison).
>
> * I also return only either python longs or python floats. This is a trivial
> change, I just don't know all the right function names to return numpy types.
>
> * A couple of places in the code still have a FIXME on them
>
> FUTURE
> =======
>
> I consider this code proof-of-concept. It's cool, it demonstrate that the idea
> works, but it is rough and needs cleaning up. There might even be bugs in it
> (especially the C interface with Python is something I am not so familiar
> with)!
>
> I see three possible paths for this:
> (1) You agree that this is nice, it achieves good results and I should strive
> to integrate this into numpy proper, replacing the current implementation. I
> would start a branch in numpy svn and implement it there and finally merge
> into the main branch. I would be perfectly happy to relicense to BSD if this
> is the case.
> One could think of the array-to-array (A+B, A+=2, A != B,...) operations
> using a similar scheme
> This would imply that part of the C API would be obsoleted: the whole
> functions array would stop making sense. I don't know how widespread it's
> used.
>
> (2) I keep maintaining this as a separate package and whoever wants to use
> it, can use it. I will probably keep it GPL, for now.
> Of course, at some later point in time, one could integrate this into the
> main branch (Again, in this case, I am happy to relicense).
>
> (3) Someone decides to take this as a challenge and re-implements ndarray.std
> and the like so that it uses less memory and is faster, but does it still in
> C. I am not much of a C programmer, so I don't see how this could be done
> without really ugly reliance on the preprocessor, but maybe it could be done.
>
> What do you think?
>
> bye,
> Luís Pedro Coelho
> PhD Student in Computational Biology
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
FYI for computing variance (and hence std) you probably should be using
Knuth's (or Welford's) one pass approach (On-line algorithm) to avoid
recomputing the mean:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
Also for large arrays, you may also want to maximize the precision to
avoid potential overflow.
Bruce
More information about the Numpy-discussion
mailing list