[Numpy-discussion] [Newbie] Fast plotting
Franck Pommereau
pommereau@univ-paris12...
Wed Jan 7 06:37:53 CST 2009
Hi all,
First, let me say that I'm impressed: this mailing list is probably the
most reactive I've ever seen. I've asked my first question and got
immediately more solutions than time to test them... Many thanks to all
the answerers.
Using the various proposals, I ran two performance tests:
- test 1: 2000000 random values
- test 2: 1328724 values from my real use case
Here are the various functions and how they perform:
def f0 (x, y) :
"""Initial version
test 1 CPU times: 13.37s
test 2 CPU times: 5.92s
"""
s, n = {}, {}
for a, b in zip(x, y) :
s[a] = s.get(a, 0.0) + b
n[a] = n.get(a, 0) + 1
return (numpy.array([a for a in sorted(s)]),
numpy.array([s[a]/n[a] for a in sorted(s)]))
def f1 (x, y) :
"""Alan G Isaac <aisaac@american.edu>
Modified in order to sort the result only once.
test 1 CPU times: 10.86s
test 2 CPU times: 2.78s
defaultdict indeed speeds things up, probably avoiding one of two
sorts is good also
"""
s, n = defaultdict(float), defaultdict(int)
for a, b in izip(x, y) :
s[a] += b
n[a] += 1
new_x = numpy.array([a for a in sorted(s)])
return (new_x,
numpy.array([s[a]/n[a] for a in new_x]))
def f2 (x, y) :
"""Francesc Alted <faltet@pytables.org>
Modified with preallocation of arrays (it appeared faster)
test 1: killed after more than 10 minutes
test 2 CPU times: 22.01s
This result is not surprising as I guess a quadratic complexity: one
pass for each unique value in x, and presumably one nested pass to
compute y[x==i]
"""
u = numpy.unique(x)
m = numpy.array(range(len(u)))
for pos, i in enumerate(u) :
g = y[x == i]
m[pos] = g.mean()
return u, m
def f3 (x, y) :
"""Sebastian Stephan Berg <sebastian@sipsolutions.net>
Modified because I can always work in place.
test 1 CPU times: 17.43s
test 2 CPU times: 0.21s
Adopted! This is definitely the fastest one when using real values.
I tried to preallocate arrays by setting u=numpy.unique(x) and the
looping on u, but the result is slower, probably because of unique()
Compared with f1, its slower on larger arrays of random values. It
may be explained by a complexity argument: f1 as a linear complexity
(two passes in sequence) while f3 is probably N log N (a sequence of
one sort, two passes to set x[:] and y[:] and one loop on each
distinct value with a nested searchsorted that is probably
logarithmic). But, real values are far from random, and the sort is
probably more efficient, as well as the while loop is shorter
because there are less values.
"""
s = x.argsort()
x[:] = x[s]
y[:] = y[s]
u, means, start, value = [], [], 0, x[0]
while True:
next = x.searchsorted(value, side='right')
u.append(value)
means.append(y[start:next].mean())
if next == len(x):
break
value = x[next]
start = next
return numpy.array(u), numpy.array(means)
def f4 (x, y) :
"""Jean-Baptiste Rudant <boogaloojb@yahoo.fr>
test 1 CPU times: 111.21s
test 2 CPU times: 13.48s
As Jean-Baptiste noticed, this solution is not very efficient (but
works almost of-the-shelf).
"""
recXY = numpy.rec.fromarrays((x, x), names='x, y')
return matplotlib.mlab.rec_groupby(recXY, ('x',),
(('y', numpy.mean, 'y_avg'),))
A few more remarks.
Sebastian Stephan Berg wrote:
> Just thinking. If the parameters are limited, you may be able to use the
> histogram feature? Doing one histogram with Y as weights, then one
> without weights and calculating the mean from this yourself should be
> pretty speedy I imagine.
I'm afraid I don't know what the histogram function computes. But this
may be something worth to investigate because I think I'll need it later
on in order to smooth my graphs (by plotting mean values on intervals).
Bruce Southey wrote:
> If you use Knuth's one pass approach
>
(http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#III._On-line_algorithm)
> you can write a function to get the min, max, mean and variance/standard
> deviation in a single pass through the array rather than one pass for
> each. I do not know if this will provide any advantage as that will
> probably depend on the size of the arrays.
If I understood well, this algorithm computes the variance of a whole
array, I can see how to adapt it to compute mean (already done by the
algorithm), max, min, etc., but I did not see how it can be adapted to
my case.
> Also, please use the highest precision possible (ie float128) for your
> arrays to minimize numerical error due to the size of your arrays.
Thanks for the advice!
So, thank you again everybody.
Cheers,
Franck
More information about the Numpy-discussion
mailing list