[Numpy-discussion] numpy videos
Abhishek Pratap
apratap@lbl....
Tue Mar 13 17:58:11 CDT 2012
Thanks guys..very handy examples by Francesc. I need to bookmark them
until I reach this point.
best,
-Abhi
On Tue, Mar 13, 2012 at 9:24 AM, Francesc Alted <francesc@continuum.io> wrote:
> On Mar 13, 2012, at 7:31 AM, Sturla Molden wrote:
>
>> On 12.03.2012 23:23, Abhishek Pratap wrote:
>>> Super awesome. I love how the python community in general keeps the
>>> recordings available for free.
>>>
>>> @Adam : I do have some problems that I can hit numpy with, mainly
>>> bigData based. So in summary I have millions/billions of rows of
>>> biological data on which I want to run some computation but at the
>>> same time have a capability to do quick lookup. I am not sure if numpy
>>> will be applicable for quick lookups by a string based key right ??
>>
>>
>> Jason Kinser's book on Python for bioinformatics might be of interest. Though I don't always agree with his NumPy coding style.
>>
>> As for "big data", it is a problem regardless of language. The HDF5 library might be of help (cf. PyTables or h5py, I actually prefer the latter).
>
> Yes, however IMO PyTables does adapt better to the OP lookup user case. For example, let's suppose a very simple key-value problem, where we need to locate a certain value by using a key. Using h5py I get:
>
> In [1]: import numpy as np
>
> In [2]: N = 100*1000
>
> In [3]: sa = np.fromiter((('key'+str(i), i) for i in xrange(N)), dtype="S8,i4")
>
> In [4]: import h5py
>
> In [5]: f = h5py.File('h5py.h5', 'w')
>
> In [6]: d = f.create_dataset('sa', data=sa)
>
> In [7]: time [val for val in d if val[0] == 'key500']
> CPU times: user 28.34 s, sys: 0.06 s, total: 28.40 s
> Wall time: 29.25 s
> Out[7]: [('key500', 500)]
>
> Another option is to use fancy selection:
>
> In [8]: time d[d['f0']=='key500']
> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
> Wall time: 0.01 s
> Out[8]:
> array([('key500', 500)],
> dtype=[('f0', 'S8'), ('f1', '<i4')])
>
> Hmm, time resolution is too poor here. Let's use the %timeit magic:
>
> In [9]: timeit d[d['f0']=='key500']
> 100 loops, best of 3: 9.3 ms per loop
>
> which is much better. But, in this case you need to load the column d['f0'] completely in-memory, and this is *not* what you want when you have large tables that does not fit in-memory.
>
> Using PyTables:
>
> In [10]: import tables
>
> In [11]: ft = tables.openFile('pytables.h5', 'w')
>
> In [12]: dt = ft.createTable(ft.root, 'sa', sa)
>
> In [13]: time [val[:] for val in dt if val[0] == 'key500']
> CPU times: user 0.04 s, sys: 0.01 s, total: 0.05 s
> Wall time: 0.04 s
> Out[13]: [('key500', 500)]
>
> That's almost a 100x of speed-up compared with h5py. But, in addition, PyTables has specific machinery to optimize these queries by using the numexpr behind the scenes:
>
> In [14]: time [val[:] for val in dt.where("f0=='key500'")]
> CPU times: user 0.01 s, sys: 0.00 s, total: 0.01 s
> Wall time: 0.00 s
> Out[14]: [('key500', 500)]
>
> Again, time resolution is too poor here. Let's use timeit magic:
>
> In [15]: timeit [val[:] for val in dt.where("f0=='key500'")]
> 100 loops, best of 3: 2.36 ms per loop
>
> This is an additional 10x speed-up. In fact, this is almost as fast as performing the query using NumPy directly:
>
> In [16]: timeit sa[sa['f0']=='key500']
> 100 loops, best of 3: 2.14 ms per loop
>
> with the difference that PyTables uses an out-of-core paradigm (i.e. it does not need to load the datasets completely in-memory). And finally, PyTables does support true indexing capabilities, so that you do not have to read the complete dataset for getting results:
>
> In [17]: dt.cols.f0.createIndex()
> Out[17]: 100000
>
> In [18]: timeit [val[:] for val in dt.where("f0=='key500'")]
> 1000 loops, best of 3: 213 us per loop
>
> which accounts for another additional 10x speedup. Of course, this speed up can be *much* more larger for bigger datasets, and specially for those that does not fit in-memory. See:
>
> http://pytables.github.com/usersguide/optimization.html#accelerating-your-searches
>
> for more detailed rational and benchmarks in big datasets.
>
> -- Francesc Alted
>
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
More information about the NumPy-Discussion
mailing list