[Numpy-discussion] Numarray: Summary (seeding): personal code and manual suggestions on initial seeding in module RNG and RandomArray(2)

Eric Maryniak e.maryniak at pobox.com
Thu Jul 25 06:02:03 CDT 2002


Dear crunchers,

Please see my personal thoughts on the past discussion about initial
seeds some paragraphs down below, where I'd like to list concrete
code and manual enhancements aimed at providing users with a clear
understanding of it's usage (and pitfalls e.g. w/r to cryptographic
applications)...

==> Suggestions for code and manual changes w/r to initial seeding
    (down below)

But first a response to Paul's earlier message:

On Thursday 25 July 2002 08:08, Paul F Dubois wrote:
> I'm not going to change the default seed on RNG. Existing users have the
> right to stability, and not to have things change because someone thinks
> a certain choice among several reasonable ones is better than the one
> previously made.

Well, I wasn't aware of the fact that things were completely set in stone
for Numarray solely for backward compatibilty. It was my impression that
numarray and it's accompanying xx2 packages were also open for redesign.
I agree stability is important, but numarray already breaks with Numeric
in other aspects so why should RNG (RNG2 in numarray?) or other packages
not be? It's more a matter of well documenting changes I think.
Users switching to numarray will already have to take into account some
changes and verify their code.
It's not that I "think a certain choice among several reasonable ones is
better" [although my favorite is still a fixed seed, as in RNG, for reasons
of reproducibility in later re-runs of Monte Carlo's that are not possible
now, because the naive user, using a clock seed, may not have saved the
initial seed with get_seed], but that the different packages, i.c. RNG
(RNG2 to be?) and RandomArray2, should be orthogonal in this respect.
I.e. the same, so 'default always an automagical (clock whatever) random
initial seed _or_ a fixed one'. Orthogonality is a very common and accepted
design principle in computing science and for good reasons (usability).
Users changing from one PRNG to another (and using the default seed) would
otherwise be unwelcomely surprised by a sudden change in behavior of their
program.
I try to give logical arguments and real code examples in this discussion
and fail to see in Paul's reaction where I'm wrong.
By the way: in Python 2.1 alpha 2 seeding changed, too:
"""
- random.py's seed() function is new.  For bit-for-bit compatibility with
  prior releases, use the whseed function instead.  The new seed function
  addresses two problems:  (1) The old function couldn't produce more than
  about 2**24 distinct internal states; the new one about 2**45 (the best
  that can be done in the Wichmann-Hill generator).  (2) The old function
  sometimes produced identical internal states when passed distinct
  integers, and there was no simple way to predict when that would happen;
  the new one guarantees to produce distinct internal states for all
  arguments in [0, 27814431486576L).
"""

> There is the further issue here of RNG being advertised as similar to
> Cray's ranf() and that similarity extends to this default. Not to
> mention that for many purposes the current default is quite useful.

Perhaps I'm mistaken here, but RNG/Lib/__init__.py does
(-1 -> uses fixed internal seed):

    standard_generator = CreateGenerator(-1)

and:
    def ranf():
            "ranf() = a random number from the standard generator."
            return standard_generator.ranf()

And indeed Mixranf in RNG/Src/ranf.c does set them to 0:

    ...
    if(*s < 0){ /* Set default initial value */
        s48[0] = s48[1] = 0;
        Setranf(s48);
        Getranf(s48);

And this code, or I'm missing the point, uses a standard generator
from RNG, which demonstrates the same sequence of initial seeds in
re-runs (note that it does not suffer from the "1-second problem"
as RandomArray2 does, see the Appendix below for a demonstration
of that, because RNG uses milliseconds).
Note that 'ranf()' is listed in chapter 18 in Module RNG as one of
the 'Generator objects':

    $ python
    Python 2.2.1 (#1, Jun 25 2002, 20:45:02) ...
    >>> from numarray import *
    >>> from RNG import *
    >>> for i in range(3):
    ...    standard_generator.ranf()
    ...
    0.58011364857958725
    0.95051273498076583
    0.78637142533060356
    >>>

    $ python
    Python 2.2.1 (#1, Jun 25 2002, 20:45:02) ...
    >>> from numarray import *
    >>> from RNG import *
    >>> for i in range(3):
    ...     standard_generator.ranf()
    ...
    0.58011364857958725
    0.95051273498076583
    0.78637142533060356
    >>>


Ok, now then my own (and possibly biased) personal summary of the
past discussions and concrete code and manual recommendations:

==> Suggestions for code and manual changes w/r to initial seeding

Conclusions:
1. Default initial seeding should be random (and not fixed).
   This is the general consensus and while it may not win the beauty
   contest in purist software engineering circles, it also is the
   default behavior in Python's own Random/WHRandom modules. URL:
       http://web.pydoc.org/2.2/random.html
=> Recommendations:
   - Like Python's random/whrandom module, default arguments to seed()
     should not be 0, but None, and this triggers the default behavior
     which is to use a random initial seed (ideally 'truly' random from
     e.g. /dev/random or otherwise clock or whatever based), because:
     o better usability: users changing from Python's own random to
       numarray's random facilities will find familiar seed() usage
       semantics
     o often 0 itself can be a legal seed (although the MersenneTwister
       does not recommend it)
   - Like RNG provide support for using a built-in fixed seed by
     supplying negative seeds to seed(), rationale:
     o support for reproducible re-runs of Monte Carlo's without
       having to specify ones own initial seed
     o usability: naive users may not know a 'good' seed is, like:
       can it be 0 or must it be >0, what is the maximum, etc.
   - See my suggested code fix for RandomArray2.seed() in the Appendix below.
   - Likewise, in RNG:
     o CreateGenerator (s, ...) should be changed to CreateGenerator (s=None)
       Also note Python's own: def create_generators(num, delta, firstseed=None)
       from random (random.py), url: http://web.pydoc.org/2.2/random.html
     o RNG's code should be changed from testing on 0 to testing on None first
       (which results in using the clock), then on < 0 (use built-in seed),
       and then using the user provided seed (which is thus >= 0, and hence
       can also be 0)
     o 'standard_generator = CreateGenerator(-1)' should be changed to
       'standard_generator = CreateGenerator() and results in using the clock
   - Put some explicit warnings in the numarray manual, that the seeding
     of numarray's packages should _not_ be used in those parts of software
     where unpredictability of seeds is important, such as for example,
     cryptographical software for creating session keys, TCP sequence numbers
     etc. Attacks on crypto software usually center around these issues.
     Ideally, a /dev/random should be used, but with the current system
     clock based implementation, the seeds are not random, because the clock
     does not have deci-nanosecond precision (10**10 ~= 2**32) yet ;-)

Appendix
--------
** 1. "1-second problem" with RandomArray2:

$ python
Python 2.2.1 (#1, Jun 25 2002, 20:45:02) ...
>>> from numarray import *
>>> from RandomArray2 import *
>>> import time
>>> import sys
>>> sys.version
'2.2.1 (#1, Jun 25 2002, 20:45:02) \n[GCC 2.95.3 20010315 (SuSE)]'
>>> numarray.__version__
'0.3.5'
>>> for i in range(3):
...     time.time()
...     RandomArray2.seed()
...     RandomArray2.get_seed()
...     time.sleep(1)
...     print
...
1027591910.9043469
(102759, 1911)

1027591911.901091
(102759, 1912)

1027591912.901088
(102759, 1913)

>>> for i in range(3):
...     time.time()
...     RandomArray2.seed()
...     RandomArray2.get_seed()
...     time.sleep(0.3)
...     print
...
1027591966.260392
(102759, 1967)

1027591966.5510809
(102759, 1967)

1027591966.851079
(102759, 1967)

Note that Python (at least 2.2.1) own random() suffers much less
from this (on my 450 MHz machine, every 10-th millisecond or so
the seed will be different):

$ python
Python 2.2.1 (#1, Jun 25 2002, 20:45:02) ...
>>> from random import *
>>> import time
>>>
>>> for i in range(3):
...     print long(time.time() * 256)
...
263065231349
263065231349
263065231349
>>> for i in range(3):
...     print long(time.time() * 256)
...     time.sleep(.00001)
...
263065240314
263065240315
263065240317

By the way, Python's own random.seed() also suffers from this,
but on a 10th-millisecond level (on my 450 Mhz i586 at least).
For the implementation of seed() see Lib/random.py, basically a
'long(time.time()' is used:

$ python
Python 2.2.1 (#1, Jun 25 2002, 20:45:02) ...
>>> from random import *
>>> import time
>>> for i in range(3):
...     print long(time.time() * 256)
...
263065231349
263065231349
263065231349
>>> for i in range(3):
...     print long(time.time() * 256)
...     time.sleep(.00001)
...
263065240314
263065240315
263065240317

2. Proposed re-implementation of RandomArray2.seed():

def seed(x=None,y=None):
    """seed(x, y), set the seed using the integers x, y:
       x or y is None (or not specified):
          A random seed is used which in the current implementation
          may be based on the system's clock.
          Warning: do not this seed in software where the initial seed may
          not be predictable, such as for example, in cryptographical software
          for creating session keys.
       x < 0 or y < 0:
          Use the module's fixed built-in seed which is the tuple
          (1234567890L, 123456789L) (or whatever)
       x >= 0 and y >= 0
          Use the seeds specified by the user.
          (Note: some random number generators do not recommend using 0)
       Note: based on Python 2.2.1's random.seed(a=None).
       ADAPTED for _2_ seeds as required by ranlib.set_seeds(x,y)
    """
    if (x != None and type (x) != IntType) or
       (y != None and type (y) != IntType) :
        raise ArgumentError, "seed requires integer arguments (or None)."
    if x == None or y == None:
        try:
            # This would be the best, but is problematic under Windows/Mac.
            # To my knowledge there isn't a portable lib_randdevice yet.
            # As GPG, OpenSSH and OpenSSL's code show, getting entropy
            # under Windows is problematic.
            # However, Python 2.2.1's socketmodule does wrap the ssl code.
            import dev_random_device  # uses /dev/random or equivalent
            x = dev_random_device.nextvalue()   # egd.sf.net is a user space
            y = dev_random_device.nextvalue()   # alternative
        except:
            # Use Mixranf() from RNG/Src/ranf.c here or, perhaps better,
            # use Python 2.2.1's code? At least it looks simpler and does not
            # have the platform dependency's and has possibly met wider testing
            # (and why not re-use code? ;-)
            # For Python 2.2.1's random.seed(a=None), see url:
            #     http://web.pydoc.org/2.2/random.html
            # and file Lib/random.py.
            # Do note, however, that on my 450 Mhz machine, the statement
            # 'long(time.time() * 256)' will generate the same values
            # within a tenth of a millisecond (see Appendix #1 for a code
            # example). This can be fixed by doing a time.sleep(0.001).
            # See my #EM# comment.
            # Naturally this code needs to be adapted for ranlib's
            # generator, because this code uses the Wichmann-Hill generator.

---cut: Wichmann-Hill---
    def seed(self, a=None):
        """Initialize internal state from hashable object.

        None or no argument seeds from current time.

        If a is not None or an int or long, hash(a) is used instead.

        If a is an int or long, a is used directly.  Distinct values between
        0 and 27814431486575L inclusive are guaranteed to yield distinct
        internal states (this guarantee is specific to the default
        Wichmann-Hill generator).
        """

        if a is None:
            # Initialize from current time
            import time
            a = long(time.time() * 256)
            #EM# Guarantee unique a's between subsequent call's of seed()
            #EM# by sleeping one millisecond. This should not be harmful,
            #EM# because ordinarily, seed() will only be called once or so
            #EM# in a program.
            time.sleep(0.001)

        if type(a) not in (type(3), type(3L)):
            a = hash(a)

        a, x = divmod(a, 30268)
        a, y = divmod(a, 30306)
        a, z = divmod(a, 30322)
        self._seed = int(x)+1, int(y)+1, int(z)+1
---cut: Wichmann-Hill---
    elif x < 0 or y < 0:
        x = 1234567890L  # or any other suitable 0 - 2**32-1
        y = 123456789L
    ranlib.set_seeds(x,y)

3. Mersenne Twister, another PRNG:



Bye-bye,

Eric
-- 
Eric Maryniak <e.maryniak at pobox.com>
WWW homepage: http://pobox.com/~e.maryniak/
Mobile phone: +31 6 52047532, or (06) 520 475 32 in NL.

In a grocery store, the Real Programmer is the one who insists on running
the cans past the laser checkout scanner himself, because he never could
trust keypunch operators to get it right the first time.




More information about the Numpy-discussion mailing list