[SciPy-user] Dealing with Large Data Sets
Sat May 10 15:22:38 CDT 2008
> I try to create an array called 'results' as provided in an example
> below. Is there a way to do this operation more efficiently when the
> number of 'data_x' array gets larger ? Also, I am looking for pointers
> to eliminate intermediate 'data_x' arrays, while creating 'results' in
> the following procedure.
If you know a priori how many "data_x" arrays you need, you should
allocate a 3-dimensional array.
data_x = numpy.zeros((nrects, nrows, ncolumns))
I advise against using for loops over the elements of an array when
data_x is large. Many operations within numpy and scipy have been
carefully designed to work over large arrays. Developing an intuition on
how to vectorize will be very helpful. The documentation on numpy array
slicing and vectorization in numpy is extensive; Travis Oliphant's Guide
to Numpy is an excellent reference on these topics.
> from numpy import *
> from numpy.random import *
This is a bad practice so please avoid it. Either import the functions
you need or import the packages using shorter names, e.g.
import numpy as np
> # what is the best way to create an array named 'results' below
> # when number of 'data_x' (i.e., x = 1, 2.....1000) is large.
> # Also nrows and ncolumns can go upto 10000
> nrows = 5
> ncolumns = 10
> data_1 = zeros([nrows, ncolumns], 'd')
> data_2 = zeros([nrows, ncolumns], 'd')
> data_3 = zeros([nrows, ncolumns], 'd')
> # to store squared sum of each column from the arrays above
> results = zeros([3,ncolumns], 'd')
> # loop to store raw data from a numerical operation;
> # rand() is given as an example here
> for i in range(nrows):
> for j in range(ncolumns):
> data_1[i,j] = rand()
> data_2[i,j] = rand()
> data_3[i,j] = rand()
numpy.random.rand(m, n) generates an m by n array of doubles drawn from
a U[0, 1] while numpy.random.rand(q, m, n) generates a q by m by n
array. The rand function can take any number of integer arguments to
generate a random array of arbitrary dimension.
> # store squared sum of each column from data_x
> for k in range(ncolumns):
> results[0,k] = dot(data_1[:,k], data_1[:,k])
> results[1,k] = dot(data_2[:,k], data_2[:,k])
> results[2,k] = dot(data_3[:,k], data_3[:,k])
> print results
The code above can be reduced to
import numpy as np
data = np.random.rand(3, 5, 10)
results = (data ** 2).sum(axis=2)
which generates a 3 by 5 by 10 array. (data ** 2) squares each value in
the array then .sum(axis=2) sums over each column generating a 3 by 5
array object, referenced by 'results'.
Since you mentioned the a possibility you may deal with large data sets.
In these situations, in-place vectorization can be very helpful. The
code below makes use of <operator>= operators
data **= 2.0
data.sum(axis = 2)
which perform the operations in an in-place fashion. If data.sum(axis =
2) is large, preallocate an array to store the sum,
# for summing over columns
sum_result = numpy.zeros(data.shape[0:2])
There are quite a few of these in-place operators available in Python
that numpy.ndarray defines. Try typing help(numpy.ndarray) for a full
Of course, it really depends what your definition of large is. I
frequently work with gigabyte+ data sets so to me that's large and tools
such as in-place vectorization, weave, mmap, C extensions are essential.
However, to others, any data too large for a human to make sense of
through cursory visual inspection is large. So, if that's what you mean
by large, you should not see appreciable gains with the vectorization
approaches I mentioned.
Below are links describing array slicing and manipulation in detail (the
first two are free),
http://www.tramy.us/ (Guide to Numpy)
I hope this helps.
More information about the SciPy-user