[SciPy-user] Dealing with Large Data Sets
Anne Archibald
peridot.faceted@gmail....
Sat May 10 14:55:22 CDT 2008
2008/5/10 lechtlr <lechtlr@yahoo.com>:
> I try to create an array called 'results' as provided in an example below.
> Is there a way to do this operation more efficiently when the number of
> 'data_x' array gets larger ? Also, I am looking for pointers to eliminate
> intermediate 'data_x' arrays, while creating 'results' in the following
> procedure.
The rule of thumb is, if you want to do the same thing to many
elements, just create an array of input values, then write the
calculation as if you had a single input value. Most numpy functions
act elementwise.
> from numpy import *
> from numpy.random import *
>
> # what is the best way to create an array named 'results' below
> # when number of 'data_x' (i.e., x = 1, 2.....1000) is large.
> # Also nrows and ncolumns can go upto 10000
>
> nrows = 5
> ncolumns = 10
>
> data_1 = zeros([nrows, ncolumns], 'd')
> data_2 = zeros([nrows, ncolumns], 'd')
> data_3 = zeros([nrows, ncolumns], 'd')
>
> # to store squared sum of each column from the arrays above
> results = zeros([3,ncolumns], 'd')
>
> # loop to store raw data from a numerical operation;
> # rand() is given as an example here
> for i in range(nrows):
> for j in range(ncolumns):
> data_1[i,j] = rand()
> data_2[i,j] = rand()
> data_3[i,j] = rand()
>
> # store squared sum of each column from data_x
> for k in range(ncolumns):
> results[0,k] = dot(data_1[:,k], data_1[:,k])
> results[1,k] = dot(data_2[:,k], data_2[:,k])
> results[2,k] = dot(data_3[:,k], data_3[:,k])
>
> print results
import numpy as np
data = np.random.rand(ndata,nrows,ncolumns)
results = (data**2).sum(axis=0)
or even
results = (np.random.rand(ndata,nrows,ncolumns)**2).sum(axis=0)
That last operation, which I have written as (data**2).sum(axis=0) is
kind of an embarrassment; dot() or its cousin tensordot() would be
more efficient, but they don't have a suitable "elementwise"
implementation. Nevertheless, squaring and then summing gives the
right answer.
Anne
More information about the SciPy-user
mailing list