[SciPy-Dev] proper way to test distributions

josef.pktd@gmai... josef.pktd@gmai...
Mon Jun 14 22:36:05 CDT 2010


On Mon, Jun 14, 2010 at 11:07 PM, Vincent Davis
<vincent@vincentdavis.net> wrote:
> I was reviewing the how tests of distribution where done in scipy with
> the thought of applying the same methods to numpy.random. I have a lot
> to learn here and appreciate you suggestions.
>
> Link to the scipy test
> http://github.com/pv/scipy-work/blob/master/scipy/stats/tests/test_continuous_basic.py
>
> If I understand correctly the tests create a sample of 2000 from a
> given distribution and the compares stats (mean, var...) calculate
> with functions from numpy with those stored in the distribution
> instant .stats  I am not sure how the mean is calculated within the
> distribution (is it just using the scipy mean)  Anyway this seems a
> little circular.

sample size is 1000 not 2000 from a quick look

distribution mean, var, skew, kurtosis are theoretical results which
are compared to the sample moments from the random numbers. This tests
that the random numbers correspond to the theoretical distribution and
that the theoretical moments are correct.

The scipy.stats.distributions test suite is "heavy", with full (slow
tests are not skipped) the test suite takes 5-6 minutes on my
computer.
The test suite initially didn't use a seed and I had additional
scripts for fuzz testing to catch the bugs. The test suite was written
for the purpose of bug-hunting and could be simplified for regression
tests.

The advantage of the current setup is that I only need to add one line
to test a new set of parameters for a given distribution or for a new
distribution and I get the full checks that the results are
(statistically) correct.

Another option would be to use "certified" benchmark results for the
distributions, e.g. compared to R, and I checked some of them for
distributions where I needed to look at some details, but doing it
across the board for 90 (+ possibly new) distributions sounds very
painful.

Some tests are still disabled because there are still (un)known errors
in skew, kurtosis and entropy in some distributions, and fit doesn't
work with the defaults for all distributions.

One advantage to testing numpy.random in scipy.stats is that
scipy.stats has all the theoretical results available.

so currently the tests for scipy.stats.distributions are mostly 3) in
your categorization

Josef

>
> Maybe I am missing something but here are my thought.
>
> 1) Using seed() and the comparing the actual results (arrays) helps to
> make sure the code is stable but tells you nothing about the quality
> of the distribution.
>
> 2) Using seed() and the calculating the moments (with numpy and
> dist.stats) is not really any different that (1)
>
> 3) drawing a large sample (possibly using seed()) and calculating the
> moments and comparing the to the theoretical moments seems like the
> best option. But this could be slow.
>
> What is the best way?
> What is desired in numpy?
>
> And a little off topic but isn't numpy.random duplicating scipy or
> scipy duplicating numpy?




>
> Thanks
> Vincent
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>


More information about the SciPy-Dev mailing list