[Numpy-discussion] Are masked arrays slower for processing than ndarrays?

Eric Firing efiring@hawaii....
Sat May 9 17:01:31 CDT 2009


Eli Bressert wrote:
> Hi,
> 
> I'm using masked arrays to compute large-scale standard deviation,
> multiplication, gaussian, and weighted averages. At first I thought
> using the masked arrays would be a great way to sidestep looping
> (which it is), but it's still slower than expected. Here's a snippet
> of the code that I'm using it for.
> 
> # Computing nearest neighbor distances.
> # Output will be about 270,000 rows long for the index
> # and 270,000x50 for the dist array.
> tree = ann.kd_tree(np.column_stack([l,b]))
> index, dist = tree.search(np.column_stack([l,b]),k=nth)
> 
> # Clipping bad values by replacing them acceptable values
> av[np.where(av<=-10)] = -10
> av[np.where(av>=50)] = 50
> 
> # Distance clipping and creating mask
> dist_arcsec = np.sqrt(dist)*3600
> mask = dist_arcsec <= d_thresh
> 
> # Creating masked array
> av_good = ma.array(av[index],mask=mask)
> dist_good = ma.array(dist_arcsec,mask=mask)
> 
> # Reason why I'm using masked arrays. If these were
> # ndarrays with nan's, then the output would be nan.
> Std = np.array(np.std(av_good,axis=1))
> Var = Std*Std
> 
> Rho = np.zeros( (len(av), nth) )
> Rho2  = np.zeros( (len(av), nth) )
> 
> dist_std = np.std(dist_good,axis=1)
> 
> for j in range(nth):
>     Rho[:,j] = dist_std
>     Rho2[:,j] = Var
> 
> # This part takes about 20 seconds to compute for a 270,000x50 masked array.
> # Using ndarrays of the same size takes about 2 second
> spatial_weight = 1.0 / (Rho*np.sqrt(2*np.pi)) * np.exp( - dist_good /
> (2*Rho**2))
> 
> # Like the spatial_weight section, this takes about 20 seconds
> W = spatial_weight / Rho2

The short answer to your subject line is "yes".  A simple illustration 
of division:

In [11]:x = np.ones((270000,50), float)

In [12]:y = np.ones((270000,50), float)

In [13]:timeit x/y
10 loops, best of 3: 199 ms per loop

In [14]:x = np.ma.ones((270000,50), float)

In [15]:y = np.ma.ones((270000,50), float)

In [16]:x[1,1] = np.ma.masked

In [17]:y[1,2] = np.ma.masked

In [18]:timeit x/y
10 loops, best of 3: 2.45 s per loop

So it is slower by more than a factor of 10.  That's much worse than I 
expected for division (and multiplication is similar).  It makes me 
suspect there is might be a simple way to improve it greatly, but I 
haven't looked.


> 
> # Takes less than one second.
> Ave = np.average(av_good,axis=1,weights=W)
> 
> Any ideas on why it would take such a long time for processing?
> Especially the spatial_weight and W variables? Would there be a faster
> way to do this? Or is there a way that numpy.std can process ignore
> nan's when processing?

There is a numpy.nansum; and see the following thread:
http://www.mail-archive.com/numpy-discussion@scipy.org/msg09407.html

Eric

> 
> Thanks,
> 
> Eli Bressert
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



More information about the Numpy-discussion mailing list