[Numpy-discussion] How to make "lazy" derived arrays in a recarray view of a memmap on large files
Kim Hansen
slaunger@gmail....
Mon Jan 19 04:23:01 CST 2009
Hi Yakov,
Thank you for you kind advice. I ended up doing something simpler and
less arcane.
I first read the baconandeggs interleaved data from a memmap, create a
new writeable memmap based on an unpacked dtype, where bacon and eggs
are in two different '<u1' fields, and then I simply convert the
interleaved baconandeggs data once and for all into the unpacked
(bacon, eggs) data structure in another file. If I do the array
conversions as one-lines I get transient large memory use as expected,
but the memory usage goes down again after taking the few seconds it
takes to convert the data once and for all. And if I have really large
arrays, I have the option of doing the copying in chunks of say,
200000 elements using slices. In that way I end up having a standard
memmapped recarray, where all fields are easily accessible in a
transparent manner, without having to implement and maintain the
container-type logic. So I have transformed it into a simple data
transformation problem, where the conversion code is centralized and
easily maintained if I want to add a new field, etc.
Not rocket science, but it works. It also has the advantage, that
subsequent access to the unpacked data is efficient and
straight-forward, which is good for my application, as i need to use
the data afterwards in many different ways for many different
applications.
Cheers,
Kim
2009/1/19 Yakov Keselman <yakov.keselman@gmail.com>:
> The problem, as I understand it, is this:
>
> you have a large array and you want to define objects that (1) behave
> like arrays; (2) are derived from the large array (can be computed
> from it); (3) should not take much space if only small portions of the
> large array are ever referenced.
>
> A simple solution that would satisfy (2) and (3) would be to define a
> method that uses indices and returns real arrays, so that instead of
> saying "besp.bacon[1000000:1001000]" you'd say
> "besp.bacon(1000000:1001000)". You'd do >>4 inside the method only on
> the portion of the original array that you're interested in.
>
> A solution that would also satisfy (1) would need to implement
> array-like methods on your newly-defined class. You can probably use
> http://docs.python.org/reference/datamodel.html#emulating-container-types
> as your starting point.
>
> If your code is meant only for yourself, I'd suggest to go with the
> inconvenient but working method approach. If other people will want to
> use your class in an array-like manner, you'd have to properly define
> all the functionality that people would expect from an array.
>
> Hope this helps.
>
> = Yakov
>
>
> On 1/16/09, Kim Hansen <slaunger@gmail.com> wrote:
>> Hi numpy forum
>>
>> I need to efficiently handle some large (300 MB) recordlike binary
>> files, where some data fields are less than a byte and thus cannot be
>> mapped in a record dtype immediately.
>>
>> I would like to be able to access these derived arrays in a memory
>> efficient manner but I cannot figure out how to acheive this.
>>
>> My application of the derived arrays would never be to do operation on
>> the entire array, rather iterate over some selected elements and do
>> somthing about it - operations which seems well suited for doing on
>> demand
>>
>> I wrote a related post yesterday, which I have not received any
>> response on. I am now posting again using another example and perhaps
>> more clear example which I beleive describes my problem spot on
>>
>> from numpy import *
>>
>> # Python.exe memory use here: 8.14 MB
>> desc = dtype([("baconandeggs", "<u1"), ("spam","<u1"), ("parrots","<u1")])
>> index = memmap("g:/id-2008-10-25-17-ver4.idx", dtype = desc,
>> mode="r").view(recarray)
>> # The index file is very large, contains 292 MB of data
>> # Python.exe memory use: 8.16 MB, only 20 kB extra for memmap mapped to
>> recarray
>>
>> # The following instant operation takes a few secs working on 3*10^8
>> elements
>> # How can I derive new array in a lazy/ondemand/memmap manner?
>> index.bacon = index.baconandeggs >> 4
>> # python.exe memory use: 595 MB! Not surprising but how to do better??
>>
>> # Another derived array, which is resource demanding
>> index.eggs = index.baconandeggs & 0x0F
>> # python.exe memory usage is now 731 MB!
>>
>> What I'd like to do is implement a class, LazyBaconEggsSpamParrots,
>> which encapsulates the
>> derived arrays
>>
>> such that I could do
>>
>> besp = LazyBaconEggsSpamParrots("baconeggsspamparrots.idx")
>> for b in besp.bacon: #Iterate lazy
>> spam(b)
>> #Only derive the 1000 needed elements, don't do all 1000000
>> dosomething(besp.bacon[1000000:1001000])
>>
>> I envision the class would look something like this
>>
>> class LazyBaconEggsSpamParrots(object):
>>
>> def __init__(self, filename):
>> desc = dtype([("baconandeggs", "<u1"),
>> ("spam","<u1"),
>> ("parrots","<u1")])
>> self._data = memmap(filename, dtype=desc, mode='r').view(recarray)
>> # Expose the one-to-one data directly
>> self.spam = self._data.spam
>> self.parrots = self._data.parrots
>> # This would work but costs way too much memory
>> # self.bacon = self._data.baconandeggs >> 4
>> # self.eggs = self._data.baconandeggs & 0x0F
>>
>> def __getattr__(self, attr_name):
>> if attr_name == "bacon":
>> # return bacon in an on demand manner, but how?
>> elif attr_name == "eggs":
>> # return eggs in an on demand manner, but how?
>> else:
>> # If the name is not a data attribute treat it as a normal
>> # non-existing attribute - raise AttributeError
>> raise AttributeError
>>
>> but how to do the lazy part of it?
>>
>> -- Kim
>> _______________________________________________
>> Numpy-discussion mailing list
>> Numpy-discussion@scipy.org
>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> --
> Not to laugh, not to lament, not to curse, but to understand. -- Spinoza
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
More information about the Numpy-discussion
mailing list