[Numpy-discussion] Loading a > GB file into array

David Cournapeau david@ar.media.kyoto-u.ac...
Sat Dec 1 02:26:50 CST 2007


Martin Spacek wrote:
> Kurt Smith wrote:
>  > You might try numpy.memmap -- others have had success with it for
>  > large files (32 bit should be able to handle a 1.3 GB file, AFAIK).
>
> Yeah, I looked into numpy.memmap. Two issues with that. I need to 
> eliminate as much disk access as possible while my app is running. I'm 
> displaying stimuli on a screen at 200Hz, so I have up to 5ms for each 
> movie frame to load before it's too late and it drops a frame. I'm sort 
> of faking a realtime OS on windows by setting the process priority 
> really high. Disk access in the middle of that causes frames to drop. So 
> I need to load the whole file into physical RAM, although it need not be 
> contiguous. memmap doesn't do that, it loads on the fly as you index 
> into the array, which drops frames, so that doesn't work for me.
If you want to do it 'properly', it will be difficult, specially in 
python, specially on windows. This looks really similar to the problem 
of direct to disk recording, that is you record audio signals from the 
soundcard into the hard-drive (think recording a concert), and the 
proper design, at least on linux and mac os X, is to have several 
threads, one for the IO, one for any computation you may want to do 
which do not block on any condition, etc... and use special OS 
facilities (FIFO scheduling, lock pages into physical ram, etc...) as 
well as some special construct (lock-free ring buffers). This design 
works relatively well for musical applications, where the data has the 
same order of magnitude than what you are talking about, and the same 
kind of latency order (a few ms).

This may be overkill for your application, though.
>
> The 2nd problem I had with memmap was that I was getting a WindowsError 
> related to memory:
>
>  >>> data = np.memmap(1.3GBfname, dtype=np.uint8, mode='r')
>
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "C:\bin\Python25\Lib\site-packages\numpy\core\memmap.py", line 
> 67, in __new__
>      mm = mmap.mmap(fid.fileno(), bytes, access=acc)
> WindowsError: [Error 8] Not enough storage is available to process this 
> command
>
>
> This was for the same 1.3GB file. This is different from previous memory 
> errors I mentioned. I don't get this on ubuntu. I can memmap a file up 
> to 2GB on ubuntu no problem, but any larger than that and I get this:
>
>  >>> data = np.memmap(2.1GBfname, dtype=np.uint8, mode='r')
>
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
>    File "/usr/lib/python2.5/site-packages/numpy/core/memmap.py", line 
> 67, in __new__
>      mm = mmap.mmap(fid.fileno(), bytes, access=acc)
> OverflowError: cannot fit 'long' into an index-sized integer
>
> The OverflowError is on the bytes argument. If I try doing the mmap.mmap 
> directly in Python, I get the same error. So I guess it's due to me 
> running 32bit ubuntu.
Yes. 32 bits means several things in this context: you have 32 bits for 
the virtual address space, but part of it is reserved for the kernel: 
this is configurable on linux, and there is a switch for this on 
windows. By default, on windows, it is split into half: 2 Gb for the 
kernel, 2 Gb for userspace. On linux, it depends on the distribution(on 
ubuntu, it seems that the default is 3G for user space, 1 G for kernel 
space, by reading the build config). I think part of the problem (memory 
error) is related to this, at least on windows.

But the error you get above is easier to understand: an integer is 32 
bits, but since it is signed, you cannot address more than 2^31 
different locations with an integer. That's why with standard (ansi C 
stlib) functions related to files, you cannot access more than 2Gb; you 
need special api for that. In your case, because you cannot code your 
value with a signed 32 bits integer, you get this error, I guess 
(index-sized integer means signed integer, I guess). But even if it 
succeeded, you would be caught by the above problem (if you only have 2 
Gb user space for virtual adressing, I don't think you can do a mmap 
with a size which is more than that, since the whole mapping is done at 
once; I am not so knowledgeable about OS, though, so I may be totally 
wrong on this).

cheers,

David


More information about the Numpy-discussion mailing list