[Numpy-discussion] numpy large arrays?
Timothy Hochberg
tim.hochberg@ieee....
Wed Dec 12 13:40:24 CST 2007
On Dec 12, 2007 7:29 AM, Søren Dyrsting <sorendyrsting@gmail.com> wrote:
> Hi all
>
> I need to perform computations involving large arrays. A lot of rows and
> no more than e.g. 34 columns. My first choice is python/numpy because I'm
> already used to code in matlab.
>
> However I'm experiencing memory problems even though there is still 500 MB
> available (2 GB total). I have cooked down my code to following meaningless
> code snip. This code share some of the same structure and calls as my real
> program and shows the same behaviour.
>
> ********************************************************
> import numpy as N
> import scipy as S
>
> def stress():
> x = S.randn(200000,80)
> for i in range(8):
> print "%(0)d" % {"0": i}
> s = N.dot(x.T, x)
> sd = N.array([s.diagonal()])
> r = N.dot(N.ones((N.size(x,0),1),'d'), sd)
> x = x + r
> x = x / 1.01
>
> ********************************************************
>
>
> To different symptoms depending how big x are:
> 1) the program becomes extremely slow after a few iterations.
This appears to be because you are overflowing your floating point
variables. Once your data has INFs in it, it will tend to run much slower.
>
> 2) if the size of x is increased a little the program fails with the
> message "MemoryError" for example at line 'x = x + r', but different places
> in the code depending on the matrice size and which computer I'm testing.
> This might also occur after several iterations, not just during the first
> pass.
Why it would occur after several iterations I'm not sure. It's possible that
there are some cycles that it takes a while for the garbage collector to get
to and in the meantime you are chewing through all of your memory. Their are
a couple different things you could try to address that, but before you do
that, you need to clean up your algorithm and right it in idiomatic numpy. I
realize that you said the above code is meaningless, but I'm going to assume
that it's indicative of how your numpy code is written. That can be
rewritten as:
def stress2(x):
for i in range(8):
print i
x += (x**2).sum(axis=0)
x /= 1.01
return x.sum()
Not only is the above about sixty times faster, it's considerably clearer as
well. FWIW, on my box, which has a very similar setup to yours, neither
version throws a memory error.
>
> I'm using Windows XP, ActivePython 2.5.1.1, NumPy 1.0.4, SciPy 0.6.0.
>
> - Is there an error under the hood in NumPy?
Probably not in this case.
>
> - Am I balancing on the edge of the performance of Python/NumPy and should
> consider other environments. Fortran, C, BLAS, LAPACK e.t.c.
Maybe, but try cleaning things up first.
>
> - Am I misusing NumPy? Changing coding style will be a good workaround and
> even perform on larger datasets without errors?
Your code is doing a lot of extra work and creating a lot of temporaries.
I'd clean it up before I did anything else.
>
>
> Thanks in advance
> /Søren
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
--
. __
. |-\
.
. tim.hochberg@ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20071212/31f51121/attachment-0001.html
More information about the Numpy-discussion
mailing list