mysql -> record array

Erin Sheldon erin.sheldon at gmail.com
Thu Nov 16 15:28:18 CST 2006


Hi Francesc -

Unless I missed something, I think what you have
shown is that the combination of
      (getting data from database into python lists) +
      (converting to arrays)
is what is taking time.   I would guess the first takes
significantly longer than the second.

Erin

On 11/16/06, Francesc Altet <faltet at carabos.com> wrote:
> Hi,
>
> Well, lets see whether this shiny new list works :)
>
> El dj 16 de 11 del 2006 a les 11:05 -0700, en/na Tim Hochberg va
> escriure:
> > One weakness of this benchmark is that it doesn't break out how much of
> > the sqlite3 overhead is inherent to the sqlite3 engine, which I expect
> > is somewhat more complicated internally than PyTables, and how much is
> > due to all the extra layers we go through to get the data into an array
> > (native[in database]->Python Objects->Native[In record array]). To try
> > to get at least a little handle on this, I add this test:
> >
> >     def querySQLite(conn):
> >         c = conn.cursor()
> >         c.execute('select * from demo where x = 0.0')
> >         y = np.fromiter(c, dtype=dtype)
> >         return y
> >
> > This returns very little data (in the cases I ran it actually returned
> > no data). However is still needs to loop over all the records and
> > examine them. Here's what the timings looked like:
> >
> >     setup SQLite took 9.71799993515 seconds
> >     retrieve SQLite took 0.921999931335 seconds
> >     query SQLite took 0.313000202179 seconds
> >
> > I'm reluctant to conclude to conclude that 1/3 of the time is spent in
> > traversing the database and 2/3 of the time in creating the data solely
> > because databases are big voodoo to me. Still, we can probably conclude
> > that traversing the data itself is pretty expensive and we would be
> > unlikely to approach PyTables speed even if we didn't have the extra
> > overhead. On the other hand, there's a factor of three or so improvement
> > that could be realized by reducing overhead.
> >
> > Or maybe not. I think that the database has to return it's data a row at
> > a time, so there's intrinsically a lot of copying that's going to
> > happen. So, I think it's unclear whether getting the data directly in
> > native format would be significantly cheaper. I suppose that the way to
> > definitively test it would be to rewrite one of these tests in C. Any
> > volunteers?
>
> I don't think that doing this in C is going to be necessary at all.
> Sqlite3 has a console that allows doing SQL operations without Python in
> the middle.
>
> I've included your new function and here are my results for Python (just
> for comparison purposes):
>
> setup SQLite took 35.9286799431 seconds
> retrieve SQLite took 4.66451621056 seconds
> query SQLite took 0.907495975494 seconds
>
> Now, the query from the console:
>
> $ time sqlite3 /tmp/test.sql3 "select * from demo where x == 0.0"
>
> real    0m1.226s
> user    0m0.792s
> sys     0m0.120s
>
> If you sum cpu user time and sys time, you get 0.9 seconds (roughly)
> which is approximately the same than reported in Python. So, it doesn't
> seem that for this specific case the python interface is taking a
> significant amount of time (as expected).
>
> For a complete retrieval, we can do the same trick, but redirectioning
> to /dev/null to avoid writing the output to another device:
>
> time sqlite3 /tmp/test.sql3 "select * from demo" > /dev/null
>
> real    0m3.690s
> user    0m3.376s
> sys     0m0.188s
>
> In this case, the time sums up around 3.5 seconds, while the Python time
> is 4.6 seconds. However, the console has the problem that it has to
> format every row to a string before to "write" to /dev/null. So, I'd say
> that the time for a retrieval from C program should be significantly
> less than 3.5 (provided that there are an appropriate container
> in-memory for keeping the data). So, the bottleneck does effectively
> seem to be the conversion from list of lists to a recarray.
>
> Just out of curiosity, here it is the equivalent where function in
> PyTables:
>
> def queryPT(fileh):
>     t = fileh.root.table
>     y = np.fromiter(((r['x'], r['y']) for r in t.where(t.cols.x == 0)),
>                     dtype=dtype)
>     return y
>
> and its times:
>
> setup PyTables took 0.453132867813 seconds
> retrieve PyTables took 0.356657981873 seconds
> query PyTables took 0.253107070923 seconds
>
> PyTables is more than 3x faster than SQLite for this specific query. I
> guess this is because the selects are done completely through numarray
> (in PyTables 2.x selections will be done through numexpr), and it seems
> difficult to bet such a 'vectorized' selects. So, using a package like
> numarray or NumPy in the core of a database has a lot of advantages,
> most specially when speed matters.
>
> Cheers,
>
> --
> Francesc Altet    |  Be careful about using the following code --
> Carabos Coop. V.  |  I've only proven that it works,
> www.carabos.com   |  I haven't tested it. -- Donald Knuth
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>




More information about the Numpy-discussion mailing list