# [Numpy-discussion] [Newbie] Fast plotting

Franck Pommereau pommereau@univ-paris12...
Wed Jan 7 06:37:53 CST 2009

```Hi all,

First, let me say that I'm impressed: this mailing list is probably the
most reactive I've ever seen. I've asked my first question and got
immediately more solutions than time to test them... Many thanks to all

Using the various proposals, I ran two performance tests:
- test 1: 2000000 random values
- test 2: 1328724 values from my real use case

Here are the various functions and how they perform:

def f0 (x, y) :
"""Initial version

test 1 CPU times: 13.37s
test 2 CPU times: 5.92s
"""
s, n = {}, {}
for a, b in zip(x, y) :
s[a] = s.get(a, 0.0) + b
n[a] = n.get(a, 0) + 1
return (numpy.array([a for a in sorted(s)]),
numpy.array([s[a]/n[a] for a in sorted(s)]))

def f1 (x, y) :
"""Alan G Isaac <aisaac@american.edu>
Modified in order to sort the result only once.

test 1 CPU times: 10.86s
test 2 CPU times: 2.78s

defaultdict indeed speeds things up, probably avoiding one of two
sorts is good also
"""
s, n = defaultdict(float), defaultdict(int)
for a, b in izip(x, y) :
s[a] += b
n[a] += 1
new_x = numpy.array([a for a in sorted(s)])
return (new_x,
numpy.array([s[a]/n[a] for a in new_x]))

def f2 (x, y) :
"""Francesc Alted <faltet@pytables.org>
Modified with preallocation of arrays (it appeared faster)

test 1: killed after more than 10 minutes
test 2 CPU times: 22.01s

This result is not surprising as I guess a quadratic complexity: one
pass for each unique value in x, and presumably one nested pass to
compute y[x==i]
"""
u = numpy.unique(x)
m = numpy.array(range(len(u)))
for pos, i in enumerate(u) :
g = y[x == i]
m[pos] = g.mean()
return u, m

def f3 (x, y) :
"""Sebastian Stephan Berg <sebastian@sipsolutions.net>
Modified because I can always work in place.

test 1 CPU times: 17.43s
test 2 CPU times: 0.21s

Adopted! This is definitely the fastest one when using real values.
I tried to preallocate arrays by setting u=numpy.unique(x) and the
looping on u, but the result is slower, probably because of unique()

Compared with f1, its slower on larger arrays of random values. It
may be explained by a complexity argument: f1 as a linear complexity
(two passes in sequence) while f3 is probably N log N (a sequence of
one sort, two passes to set x[:] and y[:] and one loop on each
distinct value with a nested searchsorted that is probably
logarithmic). But, real values are far from random, and the sort is
probably more efficient, as well as the while loop is shorter
because there are less values.
"""
s = x.argsort()
x[:] = x[s]
y[:] = y[s]
u, means, start, value = [], [], 0, x[0]
while True:
next = x.searchsorted(value, side='right')
u.append(value)
means.append(y[start:next].mean())
if next == len(x):
break
value = x[next]
start = next
return numpy.array(u), numpy.array(means)

def f4 (x, y) :
"""Jean-Baptiste Rudant <boogaloojb@yahoo.fr>

test 1 CPU times: 111.21s
test 2 CPU times: 13.48s

As Jean-Baptiste noticed, this solution is not very efficient (but
works almost of-the-shelf).
"""
recXY = numpy.rec.fromarrays((x, x), names='x, y')
return matplotlib.mlab.rec_groupby(recXY, ('x',),
(('y', numpy.mean, 'y_avg'),))

A few more remarks.

Sebastian Stephan Berg wrote:
> Just thinking. If the parameters are limited, you may be able to use the
> histogram feature? Doing one histogram with Y as weights, then one
> without weights and calculating the mean from this yourself should be
> pretty speedy I imagine.

I'm afraid I don't know what the histogram function computes. But this
may be something worth to investigate because I think I'll need it later
on in order to smooth my graphs (by plotting mean values on intervals).

Bruce Southey wrote:
> If you use Knuth's one pass approach
>
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-line_algorithm)

> you can write a function to get the min, max, mean and variance/standard
> deviation in a single pass through the array rather than one pass for
> each. I do not know if this will provide any advantage as that will
> probably depend on the size of the arrays.

If I understood well, this algorithm computes the variance of a whole
array, I can see how to adapt it to compute mean (already done by the
algorithm), max, min, etc., but I did not see how it can be adapted to
my case.

> Also, please use the highest precision possible (ie float128) for your
> arrays to minimize numerical error due to the size of your arrays.