mysql -> record array

Tim Hochberg tim.hochberg at ieee.org
Tue Nov 14 20:09:09 CST 2006


Tim Hochberg wrote:
> John Hunter wrote:
>   
>>>>>>> "Erin" == Erin Sheldon <erin.sheldon at gmail.com> writes:
>>>>>>>             
>>>>>>>               
>>     Erin> The question I have been asking myself is "what is the
>>     Erin> advantage of such an approach?".  It would be faster, but by
>>
>> In the use case that prompted this message, the pull from mysql took
>> almost 3 seconds, and the conversion from lists to numpy arrays took
>> more that 4 seconds.  We have a list of about 500000 2 tuples of
>> floats.
>>   
>>     
> I'm no database user, but a glance at the at the docs seems to indicate 
> that you can get your data via an iterator (by iterating over the cursor 
> or some such db mumbo jumbo) rather than slurping up the whole list up 
> at once. If so, then you'll save a lot of memory by passing the iterator 
> straight to fromiter. It may even be faster, who knows.
>
> Accessing the db via the iterator could be a performance killer, but 
> it's almost certainly worth trying as it could a few megabytes of 
> storage and that in turn might speed things up.
>
>   
Assuming that I didn't mess this up too badly, it appears that using the 
iterator directly with fromiter is significantly faster than the next 
best solution (about 45%). The fromiter wrapping a list solution come in 
second, followed by numarray.array and finally way in the back, 
numpy.array. Here's the numbers:

    retrieve1 took 0.902922857514 seconds
    retrieve2 took 1.31245870634 seconds
    retrieve3 took 1.51207569677 seconds
    retrieve4 took 8.71539930354 seconds

And here is the code:

    import sqlite3, numpy as np, numarray as na, time

    N = 500000

    def setup(conn):
        c = conn.cursor()
        c.execute('''create table demo (x real, y real)''')
        data = np.random.rand(N, 2)
        c.executemany("""insert into demo values (?, ?)""", data)

    def retrieve1(conn):
        c = conn.cursor()
        c.execute('select * from demo')
        y = np.fromiter(c, dtype=[('a',float), ('b', float)])
        return y

    def retrieve2(conn):
        c = conn.cursor()
        c.execute('select * from demo')
        y = np.fromiter(c.fetchall(), dtype=[('a',float), ('b', float)])
        return y

    def retrieve3(conn):
        c = conn.cursor()
        c.execute('select * from demo')
        y = na.array(c.fetchall())
        return y

    def retrieve4(conn):
        c = conn.cursor()
        c.execute('select * from demo')
        y = np.array(c.fetchall())
        return y

    conn = sqlite3.connect(':memory:')

    setup(conn)

    t0 = time.clock()
    y1 = retrieve1(conn)
    t1 = time.clock()
    y2 = retrieve2(conn)
    t2 = time.clock()
    y3 = retrieve3(conn)
    t3 = time.clock()
    y4 = retrieve4(conn)
    t4 = time.clock()

    assert y1.shape == y2.shape == y3.shape[:1] == y4.shape[:1] == (N,)
    assert np.alltrue(y1 == y2)

    print "retrieve1 took", t1-t0, "seconds"
    print "retrieve2 took", t2-t1, "seconds"
    print "retrieve3 took", t3-t2, "seconds"
    print "retrieve4 took", t4-t3, "seconds"







-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Numpy-discussion mailing list