[Numpy-discussion] Speeding up Numeric

Francesc Altet faltet at carabos.com
Fri Jan 28 14:30:54 CST 2005


Hi Todd,

Nice to see that you can achieved a good speed-up with your
optimization path. With the next code:

import numarray
a = numarray.arange(2000)
a.shape=(1000,2)
for j in xrange(1000):
    for i in range(len(a)): 
 row=a[i]
 
and original numarray-1.1.1 it took 11.254s (pentium4 at 2GHz). With your
patch, this time has been reduced to 7.816s. Now, following your
suggestion to push NumArray.__del__ down into C, I've got a good
speed-up as well: 5.332s. This is more that twice as fast as the
unpatched numarray 1.1.1. There is still a long way until we can catch
Numeric (1.123s), but it is a first step :)

The patch. Please, revise it as I'm not very used with dealing with
pure C extensions (just a Pyrex user):

Index: Lib/numarraycore.py
===================================================================
RCS file: /cvsroot/numpy/numarray/Lib/numarraycore.py,v
retrieving revision 1.101
diff -r1.101 numarraycore.py
696,699c696,699
<     def __del__(self):
<         if self._shadows != None:
<             self._shadows._copyFrom(self)
<             self._shadows = None
---
>       def __del__(self):
>           if self._shadows != None:
>               self._shadows._copyFrom(self)
>               self._shadows = None
Index: Src/_numarraymodule.c
===================================================================
RCS file: /cvsroot/numpy/numarray/Src/_numarraymodule.c,v
retrieving revision 1.65
diff -r1.65 _numarraymodule.c
399a400,411
> static void
> _numarray_dealloc(PyObject *self)
> {
>   PyArrayObject *selfa = (PyArrayObject *) self;
>
>   if (selfa->_shadows != NULL) {
>     _copyFrom(selfa->_shadows, self);
>     selfa->_shadows = NULL;
>   }
>   self->ob_type->tp_free(self);
> }
>
421c433
<       0,                                      /* tp_dealloc */
---
>       _numarray_dealloc,                      /* tp_dealloc */


The profile with the new optimizations looks now like:

samples  %        image name               symbol name
453       8.6319  python                   PyEval_EvalFrame
372       7.0884  python                   lookdict_string
349       6.6502  python                   string_hash
271       5.1639  libc-2.3.2.so            _wordcopy_bwd_aligned
210       4.0015  libnumarray.so           NA_updateStatus
194       3.6966  python                   _PyString_Eq
185       3.5252  libc-2.3.2.so            __GI___strcasecmp
162       3.0869  python                   subtype_dealloc
158       3.0107  libc-2.3.2.so            _int_malloc
147       2.8011  libnumarray.so           isBufferWriteable
145       2.7630  python                   PyDict_SetItem
135       2.5724  _ndarray.so              _view
131       2.4962  python                   PyObject_GenericGetAttr
122       2.3247  python                   PyDict_GetItem
100       1.9055  python                   PyString_InternInPlace
94        1.7912  libnumarray.so           getReadBufferDataPtr
77        1.4672  _ndarray.so              _simpleIndexingCore

i.e. time spent in libc and libnumarray is going up in the list, as it
should. Now, we have to concentrate in other points of optimization.
Perhaps is a good time to have a try on recompiling the kernel and
getting the call tree...

Cheers,

A Divendres 28 Gener 2005 12:48, Todd Miller va escriure:
> I got some insight into what I think is the tall pole in the profile:
> sub-array creation is implemented using views.  The generic indexing
> code does a view() Python callback because object arrays override view
> ().  Faster view() creation for numerical arrays can be achieved like
> this by avoiding the callback:
>
> Index: Src/_ndarraymodule.c
> ===================================================================
> RCS file: /cvsroot/numpy/numarray/Src/_ndarraymodule.c,v
> retrieving revision 1.75
> diff -c -r1.75 _ndarraymodule.c
> *** Src/_ndarraymodule.c        14 Jan 2005 14:13:22 -0000      1.75
> --- Src/_ndarraymodule.c        28 Jan 2005 11:15:50 -0000
> ***************
> *** 453,460 ****
>                 }
>         } else {  /* partially subscripted --> subarray */
>                 long i;
> !               result = (PyArrayObject *)
> !                       PyObject_CallMethod((PyObject *)
> self,"view",NULL);
>                 if (!result) goto _exit;
>
>                 result->nd = result->nstrides = self->nd - nindices;
> --- 453,463 ----
>                 }
>         } else {  /* partially subscripted --> subarray */
>                 long i;
> !               if (NA_NumArrayCheck((PyObject *)self))
> !                       result = _view(self);
> !               else
> !                       result = (PyArrayObject *) PyObject_CallMethod(
> !                               (PyObject *) self,"view",NULL);
>                 if (!result) goto _exit;
>
>                 result->nd = result->nstrides = self->nd - nindices;
>
> I committed the patch above to CVS for now.  This optimization makes
> view() "non-overridable" for NumArray subclasses so there is probably a
> better way of doing this.
>
> One other thing that struck me looking at your profile,  and it has been
> discussed before,  is that NumArray.__del__() needs to be pushed (back)
> down into C.   Getting rid of __del__ would also synergyze well with
> making an object freelist,  one aspect of which is capturing unneeded
> objects rather than destroying them.
>
> Thanks for the profile.
>
> Regards,
> Todd
>
> On Thu, 2005-01-27 at 21:36 +0100, Francesc Altet wrote:
> > Hi,
> >
> > After a while of waiting for some free time, I'm playing myself with
> > the excellent oprofile, and try to help in reducing numarray creation.
> >
> > For that goal, I selected the next small benchmark:
> >
> > import numarray
> > a = numarray.arange(2000)
> > a.shape=(1000,2)
> > for j in xrange(1000):
> >     for i in range(len(a)):
> >         row=a[i]
> >
> > I know that it mixes creation with indexing cost, but as the indexing
> > cost of numarray is only a bit slower (perhaps a 40%) than Numeric,
> > while array creation time is 5 to 10 times slower, I think this
> > benchmark may provide a good starting point to see what's going on.
> >
> > For numarray, I've got the next results:
> >
> > samples  %        image name               symbol name
> > 902       7.3238  python                   PyEval_EvalFrame
> > 835       6.7798  python                   lookdict_string
> > 408       3.3128  python                   PyObject_GenericGetAttr
> > 384       3.1179  python                   PyDict_GetItem
> > 383       3.1098  libc-2.3.2.so            memcpy
> > 358       2.9068  libpthread-0.10.so       __pthread_alt_unlock
> > 293       2.3790  python                   _PyString_Eq
> > 273       2.2166  libnumarray.so           NA_updateStatus
> > 273       2.2166  python                   PyType_IsSubtype
> > 271       2.2004  python                   countformat
> > 252       2.0461  libc-2.3.2.so            memset
> > 249       2.0218  python                   string_hash
> > 248       2.0136  _ndarray.so              _universalIndexing
> >
> > while for Numeric I've got this:
> >
> > samples  %        image name               symbol name
> > 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> > 216      12.1144  libc-2.3.2.so            memmove
> > 187      10.4879  python                   lookdict_string
> > 162       9.0858  python                   PyEval_EvalFrame
> > 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> > 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> > 56        3.1408  python                   PyDict_SetItem
> > 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> > 45        2.5238  _numpy.so               
> > PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
> >  __malloc
> > 36        2.0191  libc-2.3.2.so            __cfree
> >
> > one preliminary result is that numarray spends a lot more time in
> > Python space than do Numeric, as Todd already said here. The problem
> > is that, as I have not yet patched my kernel, I can't get the call
> > tree, and I can't look for the ultimate responsible for that.
> >
> > So, I've tried to run the profile module included in the standard
> > library in order to see which are the hot spots in python:
> >
> > $ time ~/python.nobackup/Python-2.4/python -m profile -s time
> > create-numarray.py
> >          1016105 function calls (1016064 primitive calls) in 25.290 CPU
> > seconds
> >
> >    Ordered by: internal time
> >
> >    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
> >         1   19.220   19.220   25.290   25.290 create-numarray.py:1(?)
> >    999999    5.530    0.000    5.530    0.000
> > numarraycore.py:514(__del__) 1753    0.160    0.000    0.160    0.000
> > :0(eval)
> >         1    0.060    0.060    0.340    0.340 numarraycore.py:3(?)
> >         1    0.050    0.050    0.390    0.390 generic.py:8(?)
> >         1    0.040    0.040    0.490    0.490 numarrayall.py:1(?)
> >      3455    0.040    0.000    0.040    0.000 :0(len)
> >         1    0.030    0.030    0.190    0.190
> > ufunc.py:1504(_makeCUFuncDict) 51    0.030    0.001    0.070    0.001
> > ufunc.py:184(_nIOArgs) 3572    0.030    0.000    0.030    0.000
> > :0(has_key)
> >      2582    0.020    0.000    0.020    0.000 :0(append)
> >      1000    0.020    0.000    0.020    0.000 :0(range)
> >         1    0.010    0.010    0.010    0.010 generic.py:510
> > (_stridesFromShape)
> >      42/1    0.010    0.000   25.290   25.290 <string>:1(?)
> >
> > but, to say the truth, I can't really see where the time is exactly
> > consumed. Perhaps somebody with more experience can put more light on
> > this?
> >
> > Another thing that I find intriguing has to do with Numeric and
> > oprofile output. Let me remember:
> >
> > samples  %        image name               symbol name
> > 279      15.6478  libpthread-0.10.so       __pthread_alt_unlock
> > 216      12.1144  libc-2.3.2.so            memmove
> > 187      10.4879  python                   lookdict_string
> > 162       9.0858  python                   PyEval_EvalFrame
> > 144       8.0763  libpthread-0.10.so       __pthread_alt_lock
> > 126       7.0667  libpthread-0.10.so       __pthread_alt_trylock
> > 56        3.1408  python                   PyDict_SetItem
> > 53        2.9725  libpthread-0.10.so       __GI___pthread_mutex_unlock
> > 45        2.5238  _numpy.so               
> > PyArray_FromDimsAndDataAndDescr 39        2.1873  libc-2.3.2.so          
> >  __malloc
> > 36        2.0191  libc-2.3.2.so            __cfree
> >
> > we can see that a lot of the time in the benchmark using Numeric is
> > consumed in libc space (a 37% or so). However, only a 16% is used in
> > memory-related tasks (memmove, malloc and free) while the rest seems
> > to be used in thread issues (??). Again, anyone can explain why the
> > pthread* routines take so many time, or why they appear here at all?.
> > Perhaps getting rid of these calls might improve the Numeric
> > performance even further.
> >
> > Cheers,
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: IntelliVIEW -- Interactive Reporting
> Tool for open source databases. Create drag-&-drop reports. Save time
> by over 75%! Publish reports on the web. Export to DOC, XLS, RTF, etc.
> Download a FREE copy at http://www.intelliview.com/go/osdn_nl
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/numpy-discussion

-- 
>qo<   Francesc Altet     http://www.carabos.com/
V  V   Cárabos Coop. V.   Enjoy Data
 ""





More information about the Numpy-discussion mailing list