[Numpy-discussion] multiprocessing shared arrays and numpy
Francesc Alted
faltet@pytables....
Thu Mar 11 07:26:49 CST 2010
A Thursday 11 March 2010 10:36:42 Gael Varoquaux escrigué:
> On Thu, Mar 11, 2010 at 10:04:36AM +0100, Francesc Alted wrote:
> > As far as I know, memmap files (or better, the underlying OS) *use* all
> > available RAM for loading data until RAM is exhausted and then start to
> > use SWAP, so the "memory pressure" is still there. But I may be wrong...
>
> I believe that your above assertion is 'half' right. First I think that
> it is not SWAP that the memapped file uses, but the original disk space,
> thus you avoid running out of SWAP. Second, if you open several times the
> same data without memmapping, I believe that it will be duplicated in
> memory. On the other hand, when you memapping, it is not duplicated, thus
> if you are running several processing jobs on the same data, you save
> memory. I am very much in this case.
Mmh, this is not my experience. During the past month, I was proposing in a
course the students to compare the memory consumption of numpy.memmap and
tables.Expr (a module for performing out-of-memory computations in PyTables).
The idea was precisely to show that, contrarily to tables.Expr, numpy.memmap
computations do take a lot of memory when they are being accessed.
I'm attaching a slightly modified version of that exercise. On it, one have
to compute a polynomial in a certain range. Here it is the output of the
script for the numpy.memmap case for a machine with 8 GB RAM and 6 GB of swap:
Total size for datasets: 7629.4 MB
Populating x using numpy.memmap with 500000000 points...
Total file sizes: 4000000000 -- (3814.7 MB)
*** Time elapsed populating: 70.982
Computing: '((.25*x + .75)*x - 1.5)*x - 2' using numpy.memmap
Total file sizes: 8000000000 -- (7629.4 MB)
**** Time elapsed computing: 81.727
10.08user 13.37system 2:33.26elapsed 15%CPU (0avgtext+0avgdata 0maxresident)k
7808inputs+15625008outputs (39major+5750196minor)pagefaults 0swaps
While the computation was going on, I've spied the process with the top
utility, and that told me that the total virtual size consumed by the Python
process was 7.9 GB, with a total of *resident* memory of 6.7 GB (!). And this
should not only be a top malfunction because I've checked that, by the end of
the computation, my machine started to swap some processes out (i.e. the
working set above was too large to allow the OS keep everything in memory).
Now, just for the sake of comparison, I've tried running the same script but
using tables.Expr. Here it is the output:
Total size for datasets: 7629.4 MB
Populating x using tables.Expr with 500000000 points...
Total file sizes: 4000631280 -- (3815.3 MB)
*** Time elapsed populating: 78.817
Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr
Total file sizes: 8001261168 -- (7630.6 MB)
**** Time elapsed computing: 155.836
13.11user 18.59system 3:58.61elapsed 13%CPU (0avgtext+0avgdata 0maxresident)k
7842784inputs+15632208outputs (28major+940347minor)pagefaults 0swaps
and top was telling me that memory consumption was 148 MB for total virtual
size and just 44 MB (as expected, because computation was really made using an
out-of-core algorithm).
Interestingly, when using compression (Blosc level 4, in this case), the time
to do the computation with tables.Expr has reduced a lot:
Total size for datasets: 7629.4 MB
Populating x using tables.Expr with 500000000 points...
Total file sizes: 1080130765 -- (1030.1 MB)
*** Time elapsed populating: 30.005
Computing: '((.25*x + .75)*x - 1.5)*x - 2' using tables.Expr
Total file sizes: 2415761895 -- (2303.9 MB)
**** Time elapsed computing: 40.048
37.11user 6.98system 1:12.88elapsed 60%CPU (0avgtext+0avgdata 0maxresident)k
45312inputs+4720568outputs (4major+989323minor)pagefaults 0swaps
while memory consumption is barely the same than above: 148 MB / 45 MB.
So, in my experience, numpy.memmap is really using that large chunk of memory
(unless my testbed is badly programmed, in which case I'd be grateful if you
can point out what's wrong).
--
Francesc Alted
-------------- next part --------------
A non-text attachment was scrubbed...
Name: poly.py
Type: text/x-python
Size: 4642 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20100311/7a5df7ff/attachment-0001.py
More information about the NumPy-Discussion
mailing list