[Numpy-discussion] Can I add rows and columns to recarray?
Francesc Alted
faltet@pytables....
Mon Dec 6 17:06:54 CST 2010
A Monday 06 December 2010 22:00:29 Wai Yip Tung escrigué:
> Thank you for the quick response and Christopher's explanation on the
> design background.
>
> All my tables fit in-memory. I want to explore the data interactively
> and relational database is does not provide me a lot of value.
>
> I was rolling my own library before I come to numpy. Then I find
> numpy's universal function awesome and really fit what I want to do.
> Now I just need to find out what to add row which is easy in Python.
> It is OK if it rebuild an array when I add a column, which should
> happen infrequently. But if adding row build a new array, this will
> lead to O(n^2) complexity. In anycase, I will explore the
> recfunctions.
If you want a container with a better complexity for adding columns
than O(n^2), you may want to have a look at the ctable object in carray
package:
https://github.com/FrancescAlted/carray
carray is about providing compressed, in-memory data containers for both
homogeneous (arrays) and heterogeneous data (structured arrays). Here
it is an example of use:
>>> import numpy as np
>>> import carray as ca
>>> NR = 1000*1000
>>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
>>> new_field = np.arange(NR, dtype='f8')**3
>>> rc = ca.ctable(r)
>>> rc
ctable((1000000,), [('f0', '<i4'), ('f1', '<i8')])
nbytes: 11.44 MB; cbytes: 1.71 MB; ratio: 6.70
[(0, 0), (1, 1), (2, 4), ..., (999997, 999994000009), (999998,
999996000004), (999999, 999998000001)]
>>> time rc.addcol(new_field, "f2")
CPU times: user 0.03 s, sys: 0.00 s, total: 0.03 s
Wall time: 0.03 s
that is, only 30 ms for appending a column. This is basically the time
to copy (and compress) the data (i.e. O(n)). If you append an already
compressed column, the cost of adding it is O(1):
>>> r = np.fromiter(((i,i*i) for i in xrange(NR)), dtype="i4,i8")
>>> rc = ca.ctable(r)
>>> cnew_field = ca.carray(np.arange(NR, dtype='f8')**3)
>>> time rc.addcol(cnew_field, "f2")
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
On his hand, using plain structured arrays is pretty more costly:
>>> import numpy.lib.recfunctions as nprf
>>> time r2 = nprf.rec_append_fields(r, 'f2', new_field, 'f8')
CPU times: user 0.34 s, sys: 0.02 s, total: 0.36 s
Wall time: 0.36 s
Appending data at the end of ctable objects is also very fast:
>>> timeit rc.append(row)
100000 loops, best of 3: 13.1 µs per loop
Compare this with an append with an structured array:
>>> timeit np.concatenate((r2, row))
100 loops, best of 3: 6.84 ms per loop
Unfortunately you cannot do the full range of operations supported by
structured arrays with ctables, and a ctable object is rather meant to
be used as an efficient, compressed container for structures in memory:
>>> r2[2]
(2, 4, 8.0)
>>> rc[2]
(2, 4, 8.0)
>>> r2['f1']
array([0, 1, 4, ..., 1, 1, 1])
>>> rc['f1']
carray((1452223,), int64) nbytes: 11.08 MB; cbytes: 1.62 MB; ratio:
6.85
cparams := cparams(clevel=5, shuffle=True)
[0, 1, 4, ..., 1, 1, 1]
But still, you can do funny things like complex queries:
>>> [r for r in rc.getif("(f0<10)&(f2>4)", ["__nrow__", "f1"])]
[(2, 4),
(3, 9),
(4, 16),
(5, 25),
(6, 36),
(7, 49),
(8, 64),
(9, 81),
(1041112, 1)]
The queries are also very fast (both Numexpr and Blosc are used under
the hood):
>>> timeit [r for r in rc.getif("(f0<10)&(f2>4)")]
10 loops, best of 3: 58.6 ms per loop
>>> timeit r2[(r2['f0']<10)&(r2['f2']>4)]
10 loops, best of 3: 28 ms per loop
So, queries on ctables are only 2x slower than using plain structured
arrays --of course, the secret goal is to make these sort of queries
actually faster than using structured arrays :)
I still need to finish the docs, but I plan to release carray 0.3 later
this week.
Cheers,
--
Francesc Alted
More information about the NumPy-Discussion
mailing list