[SciPy-user] Dealing with Large Data Sets

Damian Eads eads@soe.ucsc....
Sat May 10 15:22:38 CDT 2008


Hi Lex,

lechtlr wrote:
> I try to create an array called 'results' as provided in an example 
> below.  Is there a way to do this operation more efficiently when the 
> number of 'data_x' array gets larger ?  Also, I am looking for pointers 
> to eliminate intermediate 'data_x' arrays, while creating 'results' in 
> the following procedure.

If you know a priori how many "data_x" arrays you need, you should 
allocate a 3-dimensional array.

data_x = numpy.zeros((nrects, nrows, ncolumns))

I advise against using for loops over the elements of an array when 
data_x is large. Many operations within numpy and scipy have been 
carefully designed to work over large arrays. Developing an intuition on 
how to vectorize will be very helpful. The documentation on numpy array 
slicing and vectorization in numpy is extensive; Travis Oliphant's Guide 
to Numpy is an excellent reference on these topics.

> from numpy import *
> from numpy.random import *

This is a bad practice so please avoid it. Either import the functions 
you need or import the packages using shorter names, e.g.

    import numpy as np

> 
> # what is the best way to create an array named 'results' below
> # when number of 'data_x' (i.e., x = 1, 2.....1000) is large.
> # Also nrows and ncolumns can go upto 10000
> 
> nrows = 5
> ncolumns = 10
> 
> data_1 = zeros([nrows, ncolumns], 'd')
> data_2 = zeros([nrows, ncolumns], 'd')
> data_3 = zeros([nrows, ncolumns], 'd')
> 
> # to store squared sum of each column from the arrays above
> results = zeros([3,ncolumns], 'd')
> 
> # loop to store raw data from a numerical operation;
> # rand() is given as an example here
> for i in range(nrows):
>     for j in range(ncolumns):
>         data_1[i,j] = rand()
>         data_2[i,j] = rand()
>         data_3[i,j] = rand()

numpy.random.rand(m, n) generates an m by n array of doubles drawn from 
a U[0, 1] while numpy.random.rand(q, m, n) generates a q by m by n 
array. The rand function can take any number of integer arguments to 
generate a random array of arbitrary dimension.

> # store squared sum of each column from data_x
> for k in range(ncolumns):
>     results[0,k] = dot(data_1[:,k], data_1[:,k])
>     results[1,k] = dot(data_2[:,k], data_2[:,k])
>     results[2,k] = dot(data_3[:,k], data_3[:,k])
> 
> print results

The code above can be reduced to

    import numpy as np
    data = np.random.rand(3, 5, 10)
    results = (data ** 2).sum(axis=2)

which generates a 3 by 5 by 10 array. (data ** 2) squares each value in 
the array then .sum(axis=2) sums over each column generating a 3 by 5 
array object, referenced by 'results'.

Since you mentioned the a possibility you may deal with large data sets. 
In these situations, in-place vectorization can be very helpful. The 
code below makes use of <operator>= operators

   data **= 2.0
   data.sum(axis = 2)

which perform the operations in an in-place fashion. If data.sum(axis = 
2) is large, preallocate an array to store the sum,

   # for summing over columns
   sum_result = numpy.zeros(data.shape[0:2])

There are quite a few of these in-place operators available in Python 
that numpy.ndarray defines. Try typing help(numpy.ndarray) for a full 
listing.

Of course, it really depends what your definition of large is. I 
frequently work with gigabyte+ data sets so to me that's large and tools 
such as in-place vectorization, weave, mmap, C extensions are essential. 
However, to others, any data too large for a human to make sense of 
through cursory visual inspection is large. So, if that's what you mean 
by large, you should not see appreciable gains with the vectorization 
approaches I mentioned.

Below are links describing array slicing and manipulation in detail (the 
first two are free),

   http://www.scipy.org/Tentative_NumPy_Tutorial
   http://www.scipy.org/NumPy_for_Matlab_Users
   http://www.tramy.us/   (Guide to Numpy)

I hope this helps.

Cheers,

Damian


More information about the SciPy-user mailing list