[SciPy-user] What is fastest load/save matrix methods?

Dave Kuhlman dkuhlman at cutter.rexx.com
Tue Dec 20 17:16:48 CST 2005


On Tue, Dec 20, 2005 at 04:10:19PM +0100, Francesc Altet wrote:

[snip]

> 
> In summary:
> 
> - PyTables cannot compete in speed for small arrays (<10000 elements).
>   However, the latency for saving/reading these objects is quite low
>   (< 5 ms).
> 
> - For larger arrays, PyTables (and hence, HDF5) always exposes better
>   speed, even in the case of dealing with extremely large datasets
>   (which is the field were it is supposed it has been designed for).
>   

Francesc -

Thanks very much for the detailed benchmarks and the summary.

But, for those of you considering use of my_array.tofile() and
scipy.fromfile(), consider the following:

1. There is a warning message:

    >>> help(io.fromfile)
    ...
    WARNING: This function should be used sparingly, as it is not
    a robust method of persistence.  But it can be useful to
    read in simply-formatted or binary data quickly.

2. tofile() and fromfile() do not seem to preserve the shape of
   the array.  Consider:

       IPython profile: scipy
       In [1]:a1 = zeros((100,100))
       In [2]:a1
       Out[2]:NumPy array, format: long
       [[0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        ...,
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]]
       In [3]:a1.tofile('tmp1.data')
       In [4]:a2 = fromfile('tmp1.data')
       In [5]:a1
       Out[5]:NumPy array, format: long
       [[0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        ...,
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]
        [0 0 0 ..., 0 0 0]]
       In [6]:a2
       Out[6]:NumPy array, format: long
       [0 0 0 ..., 0 0 0]
       In [7]:a1.shape
       Out[7]:(100, 100)
       In [8]:a2.shape
       Out[8]:(10000,)

   A two dimension array seems to have been collapsed into a one
   dimension array.

   Or, did I not use tofile() or fromfile() correctly?

I modified Francesc's script so that it uses io.write_array() and
scipy.io.read_array(), instead.  The modified script is attached.

Below are results from my machine using this modified script.

Note that I also modified the number of interations (and
eliminated the last one altogether).

  # Sizes for the arrays.
  #array_sizes = (10, 10**2, 10**3, 10**4, 10**5, 10**6, 10**7, 10**8)
  array_sizes = (10, 10**2, 10**3, 10**4, 10**5, 10**6, 10**7,      )
  # Number of iterations for each array size.
  #niters =      (1000, 1000,  100,   100,    10,     3,     3,     1)
  niters =      (1000, 1000,  100,   100,    10,     1,     1,      )

Writing:

  $ python ../speedcomp_pt_write_array.py scipy_core w
  Python 2.4.2 (#1, Oct 31 2005, 11:22:05)
  [GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu9)]
  Optimization flags: -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
  getNCPUs has_3dnow has_3dnowext has_mmx has_sse is_32bit is_AMD is_singleCPU
  tables.__version__ --> 1.1.1
  scipy_core.__version__--> 0.8.6.1663
  ---
  PyTables,      scipy_core, 10**1 time: 5.816 ms         6.878 KB/s
  write_array(), scipy_core, 10**1 time: 0.548 ms         72.947 KB/s
  ---
  PyTables,      scipy_core, 10**2 time: 5.624 ms         71.119 KB/s
  write_array(), scipy_core, 10**2 time: 1.966 ms         203.467 KB/s
  ---
  PyTables,      scipy_core, 10**3 time: 5.2 ms   769.199 KB/s
  write_array(), scipy_core, 10**3 time: 15.171 ms        263.66 KB/s
  ---
  PyTables,      scipy_core, 10**4 time: 5.892 ms         6788.707 KB/s
  write_array(), scipy_core, 10**4 time: 140.505 ms       284.687 KB/s
  ---
  PyTables,      scipy_core, 10**5 time: 9.298 ms         43021.812 KB/s
  write_array(), scipy_core, 10**5 time: 1400.494 ms      285.614 KB/s
  ---
  PyTables,      scipy_core, 10**6 time: 35.056 ms        114102.778 KB/s
  write_array(), scipy_core, 10**6 time: 14222.73 ms      281.24 KB/s
  ---
  PyTables,      scipy_core, 10**7 time: 296.545 ms       134886.766 KB/s
  write_array(), scipy_core, 10**7 time: 149788.421 ms    267.043 KB/s


And, reading:

  $ python ../speedcomp_pt_write_array.py scipy_core r
  Python 2.4.2 (#1, Oct 31 2005, 11:22:05)
  [GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu9)]
  Optimization flags: -DNDEBUG -g -O3 -Wall -Wstrict-prototypes
  getNCPUs has_3dnow has_3dnowext has_mmx has_sse is_32bit is_AMD is_singleCPU
  tables.__version__ --> 1.1.1
  scipy_core.__version__--> 0.8.6.1663
  ---
  PyTables,     scipy_core, 10**1 time: 5.693 ms  7.026 KB/s
  read_array(), scipy_core, 10**1 time: 2.162 ms  18.498 KB/s
  ---
  PyTables,     scipy_core, 10**2 time: 5.603 ms  71.396 KB/s
  read_array(), scipy_core, 10**2 time: 13.386 ms         29.882 KB/s
  ---
  PyTables,     scipy_core, 10**3 time: 5.748 ms  695.835 KB/s
  read_array(), scipy_core, 10**3 time: 122.681 ms        32.605 KB/s
  ---
  PyTables,     scipy_core, 10**4 time: 5.617 ms  7121.265 KB/s
  read_array(), scipy_core, 10**4 time: 1228.732 ms       32.554 KB/s
  ---
  PyTables,     scipy_core, 10**5 time: 15.066 ms         26550.387 KB/s
  read_array(), scipy_core, 10**5 time: 12816.663 ms      31.209 KB/s
  ---
  PyTables,     scipy_core, 10**6 time: 178.613 ms        22394.793 KB/s
  read_array(), scipy_core, 10**6 time: 137079.957 ms     29.18 KB/s
  ---
  PyTables,     scipy_core, 10**7 time: 1015.1 ms         39404.985 KB/s
  read_array(), scipy_core, 10**7 time: 1456921.132 ms    27.455 KB/s


Looks like io.write_array() and io.read_array() are usable *only* on
moderate size arrays.

And here are the resulting file sizes:

     20 test_1.bin
   4.1K test_1.h5
    290 test_2.bin
   4.4K test_2.h5
   3.8K test_3.bin
   8.0K test_3.h5
    48K test_4.bin
    44K test_4.h5
   576K test_5.bin
   395K test_5.h5
   6.6M test_6.bin
   3.9M test_6.h5
    76M test_7.bin
    39M test_7.h5

Oops.  The .bin files were created by io.write_array().  They are
actually text files.  I should have changed the names to .txt or
something similar.

I'm only starting to learn PyTables, but I believe that it
provides additional features over and above the tofile/fromfile
and write_array/read_array strategies.  In particular, PyTables
gives you the ability to store and organize multiple arrays in
nested groups (folders) within a single HDF5 file.

Dave

-- 
Dave Kuhlman
http://www.rexx.com/~dkuhlman
-------------- next part --------------
import tables
import numarray
import scipy.base
from time import time
import math
from distutils.sysconfig import get_config_var
from scipy.distutils.cpuinfo import cpu

# Sizes for the arrays.
#array_sizes = (10, 10**2, 10**3, 10**4, 10**5, 10**6, 10**7, 10**8)
array_sizes = (10, 10**2, 10**3, 10**4, 10**5, 10**6, 10**7,      )
# Number of iterations for each array size.
#niters =      (1000, 1000,  100,   100,    10,     3,     3,     1)
niters =      (1000, 1000,  100,   100,    10,     1,     1,      )

def pytables_write(dset, lsize):
    filename = "test_"+str(lsize)+'.h5'
    h5file = tables.openFile(filename, mode = "w")
    # Write array after converting to a numarray array.
    dset = numarray.asarray(dset)
    h5file.createArray(h5file.root, 'dataset', dset)
    h5file.close()

def pytables_read(numlib, lsize):
    filename = "test_"+str(lsize)+'.h5'
    h5file = tables.openFile(filename, mode = "r")
    # Read array and then convert to a numlib array.
    dset = h5file.root.dataset[:]
    if numlib == "scipy_core":
        dset = scipy.base.asarray(dset)
    h5file.close()
    return dset

def write_array_write(dset, lsize):
    filename = "test_"+str(lsize)+'.txt'
    scipy.io.write_array(filename, dset)

def write_array_read(numlib, lsize):
    filename = "test_"+str(lsize)+'.txt'
    if numlib == "scipy_core":
        return scipy.io.read_array(filename)
    else:
        return numarray.fromfile(filename)

def test_write(numlib):
    j = 0
    for size in array_sizes:
        lsize = int(math.log10(size))
        niter = niters[j]
        j += 1
        if numlib == "scipy_core":
            dset = scipy.base.arange(size)
            itemsize = dset.itemsize
        else:
            dset = numarray.arange(size)
            itemsize = dset.itemsize()
        # Time pytables
        t1=time()
        for i in range(niter):
            pytables_write(dset, lsize)
        print "---"
        print "PyTables,      %s, 10**%s time:" % (numlib, lsize),
        tms = ((time()-t1)/niter)*1000
        print round(tms, 3), "ms",
        print "\t", round((size*itemsize)/tms, 3), "KB/s"
        # time .tofile()
        t1=time()
        for i in range(niter):
            write_array_write(dset, lsize)
        print "write_array(), %s, 10**%s time:" % (numlib, lsize),
        tms = ((time()-t1)/niter)*1000
        print round(tms, 3), "ms",
        print "\t", round((size*itemsize)/tms, 3), "KB/s"

def test_read(numlib):
    j = 0
    for size in array_sizes:
        lsize = int(math.log10(size))
        niter = niters[j]
        j += 1
        # Time pytables
        t1=time()
        for i in range(niter):
            dset = pytables_read(numlib, lsize)
        if numlib == "scipy_core":
            itemsize = dset.itemsize
        else:
            itemsize = dset.itemsize()
        print "---"
        print "PyTables,     %s, 10**%s time:" % (numlib, lsize),
        tms = ((time()-t1)/niter)*1000
        print round(tms, 3), "ms",
        print "\t", round((size*itemsize)/tms, 3), "KB/s"
        # time .tofile()
        t1=time()
        for i in range(niter):
            dset = write_array_read(numlib, lsize)
        print "read_array(), %s, 10**%s time:" % (numlib, lsize),
        tms = ((time()-t1)/niter)*1000
        print round(tms, 3), "ms",
        print "\t", round((size*itemsize)/tms, 3), "KB/s"

# main program
import sys

usage = "%s [scipy_core|numarray] [w|r]" % sys.argv[0]
if len(sys.argv) < 3:
    print "Usage:", usage
    sys.exit(1)

numlib = sys.argv[1]
test_mode = sys.argv[2]

if numlib not in ["numarray", "scipy_core"]:
    print "Only libraries numarray and scipy_core supported"
    sys.exit(1)

if test_mode not in ['r', 'w']:
    print "Only modes 'w'rite and 'r'ead supported"
    sys.exit(1)

print 'Python', sys.version
print 'Optimization flags:', get_config_var('OPT')
for name in dir(cpu):
    if name[0]=='_' and name[1]!='_':
        r = getattr(cpu,name[1:])()
        if r:
            if r!=1:
                print '%s=%s' %(name[1:],r),
            else:
                print name[1:],
print

print "tables.__version__ -->", tables.__version__
if numlib == "scipy_core":
    print "scipy_core.__version__-->", scipy.base.__version__
else:
    print "numarray.__version__-->", numarray.__version__

if test_mode == 'w':
    test_write(numlib)
else:
    import os.path
    if (not os.path.exists("test_1.h5") or
        not os.path.exists("test_1.txt")):
        print "Please, run first the benchmark in 'w'rite mode."
        sys.exit(1)
    test_read(numlib)


More information about the SciPy-user mailing list