[Numpy-discussion] mysql -> record array

Francesc Altet faltet at carabos.com
Fri Nov 17 12:56:04 CST 2006


Hi,

First of all, sorry that I've mixed t1, t2 and t3 in my previous
message (don't doing an overhaul before sending a message has
consequences :(). I'd try to clarify things here along introducing new
timings and other discoveries.

Then, as the message below is quite long, I will mention now my main
conclusions about the time taken by reading a database using SQLite3:

1. Most of the time is consumed in getting a list with python objects
   as a container for data (60%)

2. A small slice of time is used for creating the final recarray (15%)

3. Pure reading of the database is around 25% of the total time

If you want to know about how I've come to these conclusions, please,
continue reading.

A Divendres 17 Novembre 2006 14:21, Francesc Altet escrigué:
> A Dijous 16 Novembre 2006 22:28, Erin Sheldon escrigué:
> > Hi Francesc -
> >
> > Unless I missed something, I think what you have
> > shown is that the combination of
> >       (getting data from database into python lists) +
> >       (converting to arrays)
> > is what is taking time.   I would guess the first takes
> > significantly longer than the second.
>
> Seriously, I don't think I have demonstrated nothing really solid in
> that regard with so little evidences. But we can try looking for more
> of those :)
>
> For example, I'd split the times in:
>
>     t1 (getting data from database) +
>     t2 (python lists) +
>     t3 (converting to arrays) =
>     tt (total time)
[SNIP]
> So, in our case, tt for SQLite3 was 3.26 seconds. With that, we can
> derive its t1 (getting data from database):
>
> t1 = tt - t1 - t2 =~ 2.26 seconds
>
> However, this is still far more than tt for PyTables (~ 0.14 sec), so
> I'm not completely sure what's going on. Honest, I don't think that
> HDF5 (the underlying library for doing I/O in PyTables) would be
> almost 20x faster than SQLite3 for reading purposes. So my guess is
> that there should be more factors contributing tt for SQLite3 that
> I've not taken in account. Anyone can find them?

I've found another possible factor that contributes more time to t2
(python list creation). Before, I've timed the building of a list of
tuples made of numpy scalars instead of python floats (the latter
being what pysqlite actually provides). So, if we force the creation
of floats, we spend quite more time:

In [80]: Timer("[(float(x),float(x)) for x in np.arange(500000, 
dtype='float64')]", "import numpy as np").repeat(3,1)
Out[80]: [1.440762996673584, 1.0981760025024414, 1.1034038066864014]

i.e. 1.1 sec (best case) for t2 instead of 0.5 sec. That's a bit
better. Interestingly enough, it seems that the first iteration in the
loop above seems to take almost 1.5 seconds, and perhaps the 2 other
iterations could benefit of some caching effect. As the SQLite3
benchmark measures the time for only one iteration (and besides we are
searching for lost time ;) we will take 1.44 sec for now.

But... wait!, now it comes to my mind the discover of Tim that timeit
module deactivates the garbage collector. Does this affect to this?
Well, lets see:

In [85]: Timer("[(float(x),float(x)) for x in np.arange(500000, 
dtype='float64')]", "gc.enable(); import numpy as np").repeat(3,1)
Out[85]: [1.9229378700256348, 1.7526938915252686, 1.8899729251861572]

Wow, it does indeed! So, lets pick up the worst value for this: t2
will become 1.92 sec. That looks pretty good.

Now, what about t3 (recarray creation)? If we repeat the timings:

In [90]: Timer("np.fromiter(lot, dtype=dtype)", "import numpy as np; 
lot=[(float(x),float(x)) for x in np.arange(500000, dtype='float64')]; 
dtype=np.dtype([('x', 'float64'), ('y', 'float64')])").repeat(3,1)
Out[90]: [0.52559113502502441, 0.52883505821228027, 0.52126002311706543]

we see that worst time is 0.53 sec. And repeating timings with garbage
collector activated:

In [91]: Timer("np.fromiter(lot, dtype=dtype)", "gc.enable(); import numpy as 
np; lot=[(float(x),float(x)) for x in np.arange(500000, dtype='float64')]; 
dtype=np.dtype([('x', 'float64'), ('y', 'float64')])").repeat(3,1)
Out[91]: [0.52265405654907227, 0.52456402778625488, 0.52321815490722656]

Well, garbage collector doesn't affect here very much. So we will keep
t3 = 0.53 sec.

Now, the moment for the truth has come. How this would affect to t1?
(i.e. the time for getting data from database):

t1 = tt - t2 - t3 = 3.26 - 1.92 - 0.53 =~ 0.81 sec

Well, 0.81 sec is much less than the 2.26 sec computed in my previous
message.  But, is 0.81 sec a trusty figure for reading the data in
sqlite3?  Well, to begin with, I think this is very much comparable
with a previous benchmark introduced by Tim on a previous message. The
benchmark consisted on a query with a single condition that returns
nothing (so, no time would be wasted on lists or recarray creation).
Here is the time for this benchmark on my laptop:

query SQLite took 0.5474588871 seconds

so, it seems that SQLite wouldn't take no more than 0.55 seconds for
reading the complete table. But, beware, as this could be a bit
tricky! Perhaps SQLite is saving data column wise (instead of row wise
like HDF5 does); so there is a chance that what does actually measure
Tim's query is the time to read *one* column, not both (this is a
guess on my part as I don't really know whether sqlite saves its table
data row wise or column wise).

To better check this, I've modified the query to take in account both
columns instead of only one ('x = 0 or y = 0'). Here is the result:

query SQLite took 0.819041013718 seconds

This time looks suspiciouly similar to our 0.81 sec computed
previously. However, we don't know yet how much of this time can be
attributed to reading data and which one to the code for executing the
selection conditions. But perhaps measuring how much time this takes
for numexpr (in the end, both do the query in C space) can give us an
estimate for it. Here it is the measure:

For one condition:
In [32]: Timer("numexpr.evaluate('x == 0')", "import numpy; from tables import 
numexpr; x=numpy.random.rand(500000)").repeat(3,100)
Out[32]: [0.58732295036315918, 0.5366971492767334, 0.53870081901550293]

For two conditions:
In [33]: Timer("numexpr.evaluate('x == 0 or y == 0')", "import numpy; from 
tables import numexpr; x=numpy.random.rand(500000); 
y=numpy.random.rand(500000)").repeat(3,100)
Out[33]: [0.566741943359375, 0.53875589370727539, 0.53498196601867676]

So, provided that we have repeated the loop 100 times, we have that,
most probably, times required for checking the conditions in C are
just around 1/100th of the time required for doing the reading. So we
have evidence that we can obviate this contribution for total times.

With this, we can conclude (if we are brave enough ;) that times for
pure reading the SQLite3 database in our scenario is near 0.81 sec. We
have come to this figure following two different paths, namely:

1. Timing the reading of the 2-column table from python and substracting the
   python overhead and...
2. Doing a query that do imply the read of *both* columns

So, lets state the final results here:

    t1 (getting data from database) = 0.81 sec  (~25% of tt)
    t2 (creation of python lists)   = 1.92 sec  (~60% of tt)
    t3 (converting to recarrays)    = 0.53 sec  (~15% of tt)
    tt (total time, computed)       = 3.26 sec
    tt (total time, measured)       = 3.26 sec

[Incidentally, it is a big coincidence that both the computed and
measured tt would coincide so precisely, but we should accept that
this can happen sometimes, just because stochastics sometimes allow
this ;)]

I think that we have now some reasons for concluding that, when
reading big tables from a database, and the complete database is in
cache (be in filesystem cache or in database cache), 1) most of the
time is consumed in getting a list with python objects as container
for data (60%), 2) a small slice of time is used for creating the
recarray (15%), and 3) pure reading is around 25% of the total time.

As an aside, it is worth to compare the time that needs SQLite3 for
pure reading and the time that needs PyTables. I will remember this
time:

retrieve PyTables took 0.119920015335 seconds
query PyTables took 0.130054950714 seconds (one condition)
query PyTables took 0.157742023468 seconds (two conditions)

Provided that PyTables uses a native NumPy container for keeping the
data, we can attribute all this time (0.12 sec) to pure database
reading. So 4) PyTables readings (i.e. done in Python space) can be
almost 7x times faster than pure readings of SQLite3 (i.e. done at C
level). I still find this a bit shocking, and I'd say that we have
three possible explanations for this:

1. SQLite3 performance is somewhat poor for reading whole tables
   in-memory. Perhaps Postgres or MySQL would perform better

2. HDF5 shines specially for reading whole tables in-memory

3. My estimations for SQLite3 pure reads are completely wrong and
   times are actually much less!

Mmm, I'd say that repeating the tests with MySQL o Postgres could
bring more light into this. Volunteers?

Ups, I think that I've exceeded by far what I wanted to write in the
beginning of this message. Sorry about this!

Cheers,

-- 
>0,0<   Francesc Altet     http://www.carabos.com/
V   V   Cárabos Coop. V.   Enjoy Data
 "-"



More information about the Numpy-discussion mailing list