mysql -> record array

Erin Sheldon erin.sheldon at gmail.com
Tue Nov 14 20:26:48 CST 2006


On 11/14/06, Tim Hochberg <tim.hochberg at ieee.org> wrote:
> Tim Hochberg wrote:
> > John Hunter wrote:
> >
> >>>>>>> "Erin" == Erin Sheldon <erin.sheldon at gmail.com> writes:
> >>>>>>>
> >>>>>>>
> >>     Erin> The question I have been asking myself is "what is the
> >>     Erin> advantage of such an approach?".  It would be faster, but by
> >>
> >> In the use case that prompted this message, the pull from mysql took
> >> almost 3 seconds, and the conversion from lists to numpy arrays took
> >> more that 4 seconds.  We have a list of about 500000 2 tuples of
> >> floats.
> >>
> >>
> > I'm no database user, but a glance at the at the docs seems to indicate
> > that you can get your data via an iterator (by iterating over the cursor
> > or some such db mumbo jumbo) rather than slurping up the whole list up
> > at once. If so, then you'll save a lot of memory by passing the iterator
> > straight to fromiter. It may even be faster, who knows.
> >
> > Accessing the db via the iterator could be a performance killer, but
> > it's almost certainly worth trying as it could a few megabytes of
> > storage and that in turn might speed things up.
> >
> >
> Assuming that I didn't mess this up too badly, it appears that using the
> iterator directly with fromiter is significantly faster than the next
> best solution (about 45%). The fromiter wrapping a list solution come in
> second, followed by numarray.array and finally way in the back,
> numpy.array. Here's the numbers:
>
>     retrieve1 took 0.902922857514 seconds
>     retrieve2 took 1.31245870634 seconds
>     retrieve3 took 1.51207569677 seconds
>     retrieve4 took 8.71539930354 seconds
>
>

Interesting results Tim.  From Pierre's results
we saw that fromiter is the fastest way to get data
into arrays.  With your results we see there is a
difference between iterating over the cursor and
doing a fetchall() as well. Surprisingly, running
the cursor is faster.

This must come not from the data retrieval rate but
from creating the copies in memory. But just in case
I think there is one more thing to check.
I haven't used sqlite, but with other databases I have
used there is often a large variance in times from
one select to the next.  Can you
repeat these tests with a timeit().repeat  and give the
minimum?

As an aside, your database is running on a local disk, right, so
the overehead of retrieving data is minimized here?
For my tests I think I am data retrieval limited because I
get exactly the same time for the equivalent of retrieve1
and retrieve2.

Erin

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Numpy-discussion mailing list