[Numpy-discussion] Speeding up Numeric
Francesc Altet
faltet at carabos.com
Fri Jan 28 14:30:54 CST 2005
Hi Todd,
Nice to see that you can achieved a good speed-up with your
optimization path. With the next code:
import numarray
a = numarray.arange(2000)
a.shape=(1000,2)
for j in xrange(1000):
for i in range(len(a)):
row=a[i]
and original numarray-1.1.1 it took 11.254s (pentium4 at 2GHz). With your
patch, this time has been reduced to 7.816s. Now, following your
suggestion to push NumArray.__del__ down into C, I've got a good
speed-up as well: 5.332s. This is more that twice as fast as the
unpatched numarray 1.1.1. There is still a long way until we can catch
Numeric (1.123s), but it is a first step :)
The patch. Please, revise it as I'm not very used with dealing with
pure C extensions (just a Pyrex user):
Index: Lib/numarraycore.py
===================================================================
RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v
retrieving revision 1.101
diff -r1.101 numarraycore.py
696,699c696,699
< def __del__(self):
< if self._shadows != None:
< self._shadows._copyFrom(self)
< self._shadows = None
---
> def __del__(self):
> if self._shadows != None:
> self._shadows._copyFrom(self)
> self._shadows = None
Index: Src/_numarraymodule.c
===================================================================
RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v
retrieving revision 1.65
diff -r1.65 _numarraymodule.c
399a400,411
> static void
> _numarray_dealloc(PyObject *self)
> {
> PyArrayObject *selfa = (PyArrayObject *) self;
>
> if (selfa->_shadows != NULL) {
> _copyFrom(selfa->_shadows, self);
> selfa->_shadows = NULL;
> }
> self->ob_type->tp_free(self);
> }
>
421c433
< 0, /* tp_dealloc */
---
> _numarray_dealloc, /* tp_dealloc */
The profile with the new optimizations looks now like:
samples % image name symbol name
453 8.6319 python PyEval_EvalFrame
372 7.0884 python lookdict_string
349 6.6502 python string_hash
271 5.1639 libc-2.3.2.so _wordcopy_bwd_aligned
210 4.0015 libnumarray.so NA_updateStatus
194 3.6966 python _PyString_Eq
185 3.5252 libc-2.3.2.so __GI___strcasecmp
162 3.0869 python subtype_dealloc
158 3.0107 libc-2.3.2.so _int_malloc
147 2.8011 libnumarray.so isBufferWriteable
145 2.7630 python PyDict_SetItem
135 2.5724 _ndarray.so _view
131 2.4962 python PyObject_GenericGetAttr
122 2.3247 python PyDict_GetItem
100 1.9055 python PyString_InternInPlace
94 1.7912 libnumarray.so getReadBufferDataPtr
77 1.4672 _ndarray.so _simpleIndexingCore
i.e. time spent in libc and libnumarray is going up in the list, as it
should. Now, we have to concentrate in other points of optimization.
Perhaps is a good time to have a try on recompiling the kernel and
getting the call tree...
Cheers,
A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
> I got some insight into what I think is the tall pole in the profile:
> sub-array creation is implemented using views. The generic indexing
> code does a view() Python callback because object arrays override view
> (). Faster view() creation for numerical arrays can be achieved like
> this by avoiding the callback:
>
> Index: Src/_ndarraymodule.c
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
> retrieving revision 1.75
> diff -c -r1.75 _ndarraymodule.c
> *** Src/_ndarraymodule.c 14 Jan 2005 14:13:22 -0000 1.75
> --- Src/_ndarraymodule.c 28 Jan 2005 11:15:50 -0000
> ***************
> *** 453,460 ****
> }
> } else { /* partially subscripted --> subarray */
> long i;
> ! result = (PyArrayObject *)
> ! PyObject_CallMethod((PyObject *)
> self,"view",NULL);
> if (!result) goto _exit;
>
> result->nd = result->nstrides = self->nd - nindices;
> --- 453,463 ----
> }
> } else { /* partially subscripted --> subarray */
> long i;
> ! if (NA_NumArrayCheck((PyObject *)self))
> ! result = _view(self);
> ! else
> ! result = (PyArrayObject *) PyObject_CallMethod(
> ! (PyObject *) self,"view",NULL);
> if (!result) goto _exit;
>
> result->nd = result->nstrides = self->nd - nindices;
>
> I committed the patch above to CVS for now. This optimization makes
> view() "non-overridable" for NumArray subclasses so there is probably a
> better way of doing this.
>
> One other thing that struck me looking at your profile, and it has been
> discussed before, is that NumArray.__del__() needs to be pushed (back)
> down into C. Getting rid of __del__ would also synergyze well with
> making an object freelist, one aspect of which is capturing unneeded
> objects rather than destroying them.
>
> Thanks for the profile.
>
> Regards,
> Todd
>
> On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
> > Hi,
> >
> > After a while of waiting for some free time, I'm playing myself with
> > the excellent oprofile, and try to help in reducing numarray creation.
> >
> > For that goal, I selected the next small benchmark:
> >
> > import numarray
> > a = numarray.arange(2000)
> > a.shape=(1000,2)
> > for j in xrange(1000):
> > for i in range(len(a)):
> > row=a[i]
> >
> > I know that it mixes creation with indexing cost, but as the indexing
> > cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
> > while array creation time is 5 to 10 times slower, I think this
> > benchmark may provide a good starting point to see what's going on.
> >
> > For numarray, I've got the next results:
> >
> > samples % image name symbol name
> > 902 7.3238 python PyEval_EvalFrame
> > 835 6.7798 python lookdict_string
> > 408 3.3128 python PyObject_GenericGetAttr
> > 384 3.1179 python PyDict_GetItem
> > 383 3.1098 libc-2.3.2.so memcpy
> > 358 2.9068 libpthread-0.10.so __pthread_alt_unlock
> > 293 2.3790 python _PyString_Eq
> > 273 2.2166 libnumarray.so NA_updateStatus
> > 273 2.2166 python PyType_IsSubtype
> > 271 2.2004 python countformat
> > 252 2.0461 libc-2.3.2.so memset
> > 249 2.0218 python string_hash
> > 248 2.0136 _ndarray.so _universalIndexing
> >
> > while for Numeric I've got this:
> >
> > samples % image name symbol name
> > 279 15.6478 libpthread-0.10.so __pthread_alt_unlock
> > 216 12.1144 libc-2.3.2.so memmove
> > 187 10.4879 python lookdict_string
> > 162 9.0858 python PyEval_EvalFrame
> > 144 8.0763 libpthread-0.10.so __pthread_alt_lock
> > 126 7.0667 libpthread-0.10.so __pthread_alt_trylock
> > 56 3.1408 python PyDict_SetItem
> > 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock
> > 45 2.5238 _numpy.so
> > PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so
> > __malloc
> > 36 2.0191 libc-2.3.2.so __cfree
> >
> > one preliminary result is that numarray spends a lot more time in
> > Python space than do Numeric, as Todd already said here. The problem
> > is that, as I have not yet patched my kernel, I can't get the call
> > tree, and I can't look for the ultimate responsible for that.
> >
> > So, I've tried to run the profile module included in the standard
> > library in order to see which are the hot spots in python:
> >
> > $ time ~/python.nobackup/Python-2.4/python -m profile -s time
> > create-numarray.py
> > 1016105 function calls (1016064 primitive calls) in 25.290 CPU
> > seconds
> >
> > Ordered by: internal time
> >
> > ncalls tottime percall cumtime percall filename:lineno(function)
> > 1 19.220 19.220 25.290 25.290 create-numarray.py:1(?)
> > 999999 5.530 0.000 5.530 0.000
> > numarraycore.py:514(__del__) 1753 0.160 0.000 0.160 0.000
> > :0(eval)
> > 1 0.060 0.060 0.340 0.340 numarraycore.py:3(?)
> > 1 0.050 0.050 0.390 0.390 generic.py:8(?)
> > 1 0.040 0.040 0.490 0.490 numarrayall.py:1(?)
> > 3455 0.040 0.000 0.040 0.000 :0(len)
> > 1 0.030 0.030 0.190 0.190
> > ufunc.py:1504(_makeCUFuncDict) 51 0.030 0.001 0.070 0.001
> > ufunc.py:184(_nIOArgs) 3572 0.030 0.000 0.030 0.000
> > :0(has_key)
> > 2582 0.020 0.000 0.020 0.000 :0(append)
> > 1000 0.020 0.000 0.020 0.000 :0(range)
> > 1 0.010 0.010 0.010 0.010 generic.py:510
> > (_stridesFromShape)
> > 42/1 0.010 0.000 25.290 25.290 <string>:1(?)
> >
> > but, to say the truth, I can't really see where the time is exactly
> > consumed. Perhaps somebody with more experience can put more light on
> > this?
> >
> > Another thing that I find intriguing has to do with Numeric and
> > oprofile output. Let me remember:
> >
> > samples % image name symbol name
> > 279 15.6478 libpthread-0.10.so __pthread_alt_unlock
> > 216 12.1144 libc-2.3.2.so memmove
> > 187 10.4879 python lookdict_string
> > 162 9.0858 python PyEval_EvalFrame
> > 144 8.0763 libpthread-0.10.so __pthread_alt_lock
> > 126 7.0667 libpthread-0.10.so __pthread_alt_trylock
> > 56 3.1408 python PyDict_SetItem
> > 53 2.9725 libpthread-0.10.so __GI___pthread_mutex_unlock
> > 45 2.5238 _numpy.so
> > PyArray_FromDimsAndDataAndDescr 39 2.1873 libc-2.3.2.so
> > __malloc
> > 36 2.0191 libc-2.3.2.so __cfree
> >
> > we can see that a lot of the time in the benchmark using Numeric is
> > consumed in libc space (a 37% or so). However, only a 16% is used in
> > memory-related tasks (memmove, malloc and free) while the rest seems
> > to be used in thread issues (??). Again, anyone can explain why the
> > pthread* routines take so many time, or why they appear here at all?.
> > Perhaps getting rid of these calls might improve the Numeric
> > performance even further.
> >
> > Cheers,
>
