[Numpy-discussion] How to make "lazy" derived arrays in a recarray view of a memmap on large files
Mon Jan 19 03:49:12 CST 2009
The problem, as I understand it, is this:
you have a large array and you want to define objects that (1) behave
like arrays; (2) are derived from the large array (can be computed
from it); (3) should not take much space if only small portions of the
large array are ever referenced.
A simple solution that would satisfy (2) and (3) would be to define a
method that uses indices and returns real arrays, so that instead of
saying "besp.bacon[1000000:1001000]" you'd say
"besp.bacon(1000000:1001000)". You'd do >>4 inside the method only on
the portion of the original array that you're interested in.
A solution that would also satisfy (1) would need to implement
array-like methods on your newly-defined class. You can probably use
as your starting point.
If your code is meant only for yourself, I'd suggest to go with the
inconvenient but working method approach. If other people will want to
use your class in an array-like manner, you'd have to properly define
all the functionality that people would expect from an array.
Hope this helps.
On 1/16/09, Kim Hansen <email@example.com> wrote:
> Hi numpy forum
> I need to efficiently handle some large (300 MB) recordlike binary
> files, where some data fields are less than a byte and thus cannot be
> mapped in a record dtype immediately.
> I would like to be able to access these derived arrays in a memory
> efficient manner but I cannot figure out how to acheive this.
> My application of the derived arrays would never be to do operation on
> the entire array, rather iterate over some selected elements and do
> somthing about it - operations which seems well suited for doing on
> I wrote a related post yesterday, which I have not received any
> response on. I am now posting again using another example and perhaps
> more clear example which I beleive describes my problem spot on
> from numpy import *
> # Python.exe memory use here: 8.14 MB
> desc = dtype([("baconandeggs", "<u1"), ("spam","<u1"), ("parrots","<u1")])
> index = memmap("g:/id-2008-10-25-17-ver4.idx", dtype = desc,
> # The index file is very large, contains 292 MB of data
> # Python.exe memory use: 8.16 MB, only 20 kB extra for memmap mapped to
> # The following instant operation takes a few secs working on 3*10^8
> # How can I derive new array in a lazy/ondemand/memmap manner?
> index.bacon = index.baconandeggs >> 4
> # python.exe memory use: 595 MB! Not surprising but how to do better??
> # Another derived array, which is resource demanding
> index.eggs = index.baconandeggs & 0x0F
> # python.exe memory usage is now 731 MB!
> What I'd like to do is implement a class, LazyBaconEggsSpamParrots,
> which encapsulates the
> derived arrays
> such that I could do
> besp = LazyBaconEggsSpamParrots("baconeggsspamparrots.idx")
> for b in besp.bacon: #Iterate lazy
> #Only derive the 1000 needed elements, don't do all 1000000
> I envision the class would look something like this
> class LazyBaconEggsSpamParrots(object):
> def __init__(self, filename):
> desc = dtype([("baconandeggs", "<u1"),
> self._data = memmap(filename, dtype=desc, mode='r').view(recarray)
> # Expose the one-to-one data directly
> self.spam = self._data.spam
> self.parrots = self._data.parrots
> # This would work but costs way too much memory
> # self.bacon = self._data.baconandeggs >> 4
> # self.eggs = self._data.baconandeggs & 0x0F
> def __getattr__(self, attr_name):
> if attr_name == "bacon":
> # return bacon in an on demand manner, but how?
> elif attr_name == "eggs":
> # return eggs in an on demand manner, but how?
> # If the name is not a data attribute treat it as a normal
> # non-existing attribute - raise AttributeError
> raise AttributeError
> but how to do the lazy part of it?
> -- Kim
> Numpy-discussion mailing list
Not to laugh, not to lament, not to curse, but to understand. -- Spinoza
More information about the Numpy-discussion