[Numpy-discussion] Not enough storage for memmap on 32 bit WinXP for accumulated file size above approx. 1 GB

Kim Hansen slaunger@gmail....
Mon Jul 27 05:11:42 CDT 2009


2009/7/24 David Cournapeau <david@ar.media.kyoto-u.ac.jp>:
>
> Well, the questions has popped up a few times already, so I guess this
> is not so obvious :) 32 bits architecture fundamentally means that a
> pointer is 32 bits, so you can only address 2^32 different memory
> locations. The 2Gb instead of 4Gb is a consequence on how windows and
> linux kernels work. You can mmap a file which is bigger than 4Gb (as you
> can allocate more than 4Gb, at least in theory, on a 32 bits system),
> but you cannot 'see' more than 4Gb at the same time because the pointer
> is too small.
>
> Raymond Chen gives an example on windows:
>
> http://blogs.msdn.com/oldnewthing/archive/2004/08/10/211890.aspx
>
> I don't know if it is possible to do so in python, though.
>
>> The reason it isn't obvious for me is because I can read and
>> manipulate files >200 GB in Python with no problems (yes I process
>> that large files), so I thought why should it not be capable of
>> handling quite large memmaps as well...
>>
>
> Handling large files is no problem on 32 bits: it is just a matter of
> API (and kernel/fs support). You move the file location using a 64 bits
> integer and so on. Handling more than 4 Gb of memory at the same time is
> much more difficult. To address more than 4Gb, you would need a
> segmented architecture in your memory handling (with a first address for
> a segment, and a second address for the location within one segment).
>
> cheers,
>
> David
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

OK, I understand what you are saying. However, in my application it
would really be nice to have the ability to "typecast" recarrays with
an accumulated size in ecxess of 2GB onto files such that I could have
the convenient slicing notation available for accessing the data.

>From my (admittedly ignorant) point of view it seems like an
implementation detail for me, that there is a problem with some
intermediate memory address space.

My typical use case would be to access and process the large
filemapped, readonly recarray in chunks of up to 1,000,000 records 100
bytes each, or for instance pick every 1000th element of a specific
field. That is data structures, which I can easily have in RAM while
working at it.

I think it would be cool to have an alternative (possible readonly)
memmap implementation (filearray?), which is not just a wrapper around
mmap.mmap (with its 32 bit address space limitation), but which
(simply?) operates directly on the files with seek and read. I think
that could be very usefull (well for me at least, that is). In my
specific case, I will probably now proceed and make some poor mans
wrapping convenience methods implementing just the specific featuires
I need as I do not have the insight to subclass an ndarray myself and
override the needed methods. In that manner I can go to >2GB still
with low memory usage, but it will not be pretty.

Kim


More information about the NumPy-Discussion mailing list