# [Numpy-discussion] newbie question - large dataset

Anne Archibald peridot.faceted@gmail....
Sat Apr 7 13:48:47 CDT 2007

```On 07/04/07, Steve Staneff <staneff@constructiondatares.com> wrote:
> Hi,
>
> I'm looking for a better solution to managing a very large calculation.
> Set A is composed of tuples a, each of the form a = [float, string]; set B
> is composed of tuples of similar structure (b = [float, string]).  For
> each possible combination of a and b I'm calculating c, of the form c =
> f(a,b) = [g(a[0], b[0]), h(a[1], b[1])] where g() and h() are non-trivial
> functions.
>
> There are now 15,000 or more tuples in A, and 100,000 or more tuples in B.
>  B is expected to grow with time as the source database grows.  In
> addition, there are many more elements in a and b than I've stated (and
> many more functions operating on them).  I'm currently using python to
> loop through each a in A and each b in B, which takes days.
>
> If anyone can point me to a better approach via numpy ( or anything
> else!), I'd be very appreciative.

It's pretty difficult to tell what's going on from this very abstract
description. But from what you're describing, it sounds like the
majority of your time is being spent in g and h (which are being
called a billion and a half times each). I'm fairly sure that a simple
nested python loop shouldn't take days to run a billion and a half
times. If that's the case, then using numpy only to optimize the
looping will do you no good. You will need to look at some way of
making g and h faster.

First priority, of course, should be to look and see if there is some
way to reduce the amount of code that gets run a billion and a half
times - if some part of g or h is doing some calculation on only A or
only B, taking that outside the loop will be a huge win. If many pairs
(a,b) are irrelevant, zero, or negligible, not doing any calculations
on them may accelerate things dramatically. If there are only a modest
number of different strings or floating-point numbers, you may be able
to use a dictionary to cache function values.

If none of those algorithmic improvements are possible, you can look
at other possibilities for speeding things up (though the speedups
will be modest). Parallelism is an obvious one - if you've got a
multicore machine you may be able to cut your processing time by a
factor of the number of cores you have available with minimal effort
(for example by replacing a for loop with a simple foreach,
implemented as in the attached file). You could also try psyco, a
runtime optimizer, though I've never found it accelerated any code of
mine. Moving up the difficulty scale, you could try numexpr (if your
function is numeric), pyrex, or weave (which allow you to write
compiled code for use in python with a minimum of pain) or writing the
functions in C directly.

Finally, I should point out that working with gigantic arrays (c has a
billion and a half elements) in numpy can be slower than working with
them using list comprehensions (say) in stock python, because you get
better data locality, and you end up copying less and allocating fewer
(giant) intermediate arrays. In many contexts, numpy should be viewed
not as a way to speed up the execution of your code but as a way to