[Numpy-discussion] Designing a new storage format for numpy recarrays

Stephen Simmons mail@stevesimmons....
Fri Oct 30 08:18:05 CDT 2009


Hi,

Is anyone working on alternative storage options for numpy arrays, and 
specifically recarrays? My main application involves processing series 
of large recarrays (say 1000 recarrays, each with 5M rows having 50 
fields). Existing options meet some but not all of my  requirements.

Requirements
--------------
The basic requirements are:

Mandatory
 - fast
 - suitable for very large arrays (larger than can fit in memory)
 - compressed (to reduce disk space, read data more quickly)
 - seekable (can read subset of data without decompressing everything)
 - can append new data to an existing file
 - able to extract individual fields from a recarray (for when indexing 
or processing needs just a few fields)
Nice to have
 - files can be split without decompressing and recompressing (e.g. 
distribute processing over a grid)
 - encryption, ideally field-level, with encryption occurring after 
compression
 - can store multiple arrays in one physical file (convenience)
 - portable/stardard/well documented

Existing options
-----------------
Over the last few years I've tried most of numpy's options for saving 
arrays to disk, including pickles, .npy, .npz, memmap-ed files and HDF 
(Pytables).

None of these is perfect, although Pytables comes close:
 - .npy - not compressed, need to read whole array into memory
 - .npz - compressed but ZLIB compression is too slow
 - memmap - not compressed
 - Pytables (HDF using chunked storage for recarrays with LZO 
compression and shuffle filter)
    - can't extract individual field from a recarray
    - multiple dependencies (HDF, PyTables+LZO, Pyh5+LZF)
    - HDF is standard but LZO implementation is specific to Pytables 
(similarly LZF is specific to Pyh5)

Are there any other options?


Thoughts about a new format
--------------------------------
It seems that numpy could benefit from a new storage format. My first 
thoughts involve:

 - Use chunked format - split big arrays into pages of consecutive rows, 
compressed separately
 - Get good compression ratios by shuffling data before compressing 
(byte 1 of all rows, then byte 2 of all rows, ...)
 - Get efficient access to individual fields in recarrays by compressing 
each recarray field's data separately (shuffling has nice side-effect of 
separating recarray fields' data)
 - Make it fast to compress and decompress by using LZO
 - Store pages of rows (and compressd field data within a page) using a 
numpy variation of IFF chunked format (e.g. used by the DjVu scanned 
document format version 3). For example, FORM chunk for whole file, DTYP 
chunk for dtype info, DIRM chunk for directory to pages holding rows, 
NPAG chunk for a page
 - The IFF structure of named chunk types allows format to be extended 
(other compressors than LZO, encryption, links to remote data chunks, etc)

I'd appreciate any comments or suggestions before I start coding.

References
-----------
DjVu format - http://djvu.org/resources/
DjVu v3 format - http://djvu.org/docs/DjVu3Spec.djvu


Stephen

P.S. Maybe this will be too much work, and I'd be better off sticking 
with Pytables.....


More information about the NumPy-Discussion mailing list