mysql -> record array
tim.hochberg at ieee.org
Tue Nov 14 22:14:10 CST 2006
Erin Sheldon wrote:
> On 11/14/06, Tim Hochberg <tim.hochberg at ieee.org> wrote:
>> Tim Hochberg wrote:
>>> I'm no database user, but a glance at the at the docs seems to indicate
>>> that you can get your data via an iterator (by iterating over the cursor
>>> or some such db mumbo jumbo) rather than slurping up the whole list up
>>> at once. If so, then you'll save a lot of memory by passing the iterator
>>> straight to fromiter. It may even be faster, who knows.
>>> Accessing the db via the iterator could be a performance killer, but
>>> it's almost certainly worth trying as it could a few megabytes of
>>> storage and that in turn might speed things up.
>> Assuming that I didn't mess this up too badly, it appears that using the
>> iterator directly with fromiter is significantly faster than the next
>> best solution (about 45%). The fromiter wrapping a list solution come in
>> second, followed by numarray.array and finally way in the back,
>> numpy.array. Here's the numbers:
>> retrieve1 took 0.902922857514 seconds
>> retrieve2 took 1.31245870634 seconds
>> retrieve3 took 1.51207569677 seconds
>> retrieve4 took 8.71539930354 seconds
> Interesting results Tim. From Pierre's results
> we saw that fromiter is the fastest way to get data
> into arrays. With your results we see there is a
> difference between iterating over the cursor and
> doing a fetchall() as well. Surprisingly, running
> the cursor is faster.
> This must come not from the data retrieval rate but
> from creating the copies in memory.
I imagine that is correct. In particular, skipping the making of the
list avoids the creation of 1e6 Python floats, which is going to result
in a lot of memory allocation.
> But just in case
> I think there is one more thing to check.
> I haven't used sqlite, but with other databases I have
> used there is often a large variance in times from
> one select to the next. Can you
> repeat these tests with a timeit().repeat and give the
Sure. Here's two sets of numbers. The first is for repeat(3,1) and the
second for repeat (3,3).
retrieve1 [0.91198546183942375, 0.9042411814909439, 0.90411518782415001]
retrieve2 [0.98355349632425515, 0.95424502276127754,
retrieve3 [1.2227562441595268, 1.2195848913758596, 1.2206193803961156]
retrieve4 [8.4344040932576547, 8.3556245276983532, 8.3568341786456131]
retrieve1 [2.7317457945074026, 2.7274656415829384, 2.7250913174719109]
retrieve2 [2.8857103346933783, 2.8379299603720582, 2.8386803350705136]
retrieve3 [3.6870535221655203, 3.8980253076857565, 3.7002303365371887]
retrieve4 [25.138646950939304, 25.06737169109482, 25.052789390830412]
The timings of these are pretty consistent with each other with the
previous runs except that the difference between retrieve1 and retrieve2
has disappeared. In fact, all of the runs that produce lists have gotten
faster by about the same amount.. Odd! A little digging reveals that
timeit turns off garbage collection to make things more repeatable.
Turning gc back on yields the following numbers for repeat(3,1):
retrieve1 [0.92517736192728406, 0.92109667569481601,
retrieve2 [1.3018456256311914, 1.2277141368525903, 1.2929785768861706]
retrieve3 [1.5309831277438946, 1.4998853206203577, 1.5601200711263488]
retrieve4 [8.6400394463542227, 8.7022300320292061, 8.6807761880350682]
So there we are, back to our original numbers. This also reveals that
the majority of the time difference between retrieve1 and retrieve2 *is*
memory related. However, it's the deallocation (or more precisely
garbage collection) of all those floats that is the killer. Here's what
the timeit routines looked like:
if __name__ == "__main__":
for name in ['retrieve1', 'retrieve2', 'retrieve3', 'retrieve4']:
print name, timeit.Timer("%s(conn)" % name, "gc.enable();
from scratch import sqlite3, %s, setup; conn =
sqlite3.connect(':memory:'); setup(conn)" % name).repeat(3, 1)
> As an aside, your database is running on a local disk, right, so
> the overehead of retrieving data is minimized here?
> For my tests I think I am data retrieval limited because I
> get exactly the same time for the equivalent of retrieve1
> and retrieve2.
As Keith pointed out, I'm keeping the database in memory (although
there's a very good chance some of it is actually swapped to disk) so
it's probably relatively fast. On the other hand, if you are using
timeit to make your measurements you could be running into the (lack of)
garbage collection issue I mention above.
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
More information about the Numpy-discussion