[SciPy-user] predicting values based on (linear) models

josef.pktd@gmai... josef.pktd@gmai...
Thu Jan 15 10:56:07 CST 2009

On Thu, Jan 15, 2009 at 10:09 AM, Bruce Southey <bsouthey@gmail.com> wrote:
> josef.pktd@gmail.com wrote:
>> On Wed, Jan 14, 2009 at 11:24 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
>>> On Jan 14, 2009, at 10:15 PM, josef.pktd@gmail.com wrote:
>>>> The function in stats, that I tested or rewrote, are usually identical
>>>> to around 1e-15, but in some cases R has a more accurate test
>>>> distribution for small samples (option "exact" in R), while in
>>>> scipy.stats we only have the asymptotic distribution.
>>> We could try to reimplement part of it in C,. In any   case, it might
>>> be worth to output a warning (or at least be very explicit in the doc)
>>> that the results may not hold for samples smaller than 10-20.
>> I am not a "C" person and I never went much beyond HelloWorld in C.
>> I just checked some of the doc strings, and I am usually mention that
>> we use the asymptotic distribution, but there are still pretty vague
>> statements in some of the doc strings, such as
>> "The p-values are not entirely reliable but are probably reasonable for
>> datasets larger than 500 or so."
> The 'exact' test are usually Fisher's exact tests
> (http://en.wikipedia.org/wiki/Fisher%27s_exact_test) which are very
> different from the asymptotic testing and can get very demanding. Also I
> do not think that such statements should be part of the doc strings.

According to the wikipedia reference this is for contingency tables, the
two cases I worked on, were the exact two-sided Kolmogorov-Smirnov distribution,
were I found a good approximation, and the exact distribution for the
Spearman correlation coefficient for the Null of no correlation.

>>>> Also, not all
>>>> existing functions in scipy.stats are tested (yet).
>>> We should also try to make sure missing data are properly supported
>>> (not always possible) and that the results are consistent between the
>>> masked and non-masked versions.
>> I added a ticket so we don't forget to check this.
>>> IMHO, the readiness to incorporate user feedback is here. The feedback
>>> is not, or at least not as much as we'd like.
>> That depends on the subpackage, some problems in stats have been
>> reported and known for quite some time and the expected lifetime of a
>> ticket can be pretty long. I was looking at different python packages
>> that use statistics, and many of them are reluctant to use scipy while
>> numpy looks very well established. But, I suppose this will improve
>> with time and the user base will increase, especially with the recent
>> improvements in the build/distribution and the documentation.
>> Josef
>> _______________________________________________
>> SciPy-user mailing list
>> SciPy-user@scipy.org
>> http://projects.scipy.org/mailman/listinfo/scipy-user
> There are different reasons for a lack of user base. One of the reasons
> for R is that many, many statistics classes use it.
> Some of the reasons that I do not use scipy for stats (and have not
> looked at this in some time) included:
> 1) The difficulty of installation which is considerably better now.
> 2) Lack of support for missing values as virtually everything that I
> have worked with involves missing values at some stage.
> 3) Lack of an suitable statistical modeling interface where you can
> specify the model to be fit without having to create each individual
> array. The approach must work for a range of scenarios.

With 2 and 3 I have little experience
Missing observations, I usually remove or clean in the initial data
preparation. mstats provides functions for masked arrays, but stats
mostly assumes no missing values. What would be the generic treatment
for missing observations, just dropping all observations that have
NaNs or converting them to masked arrays and expand the function that
can handle those?

Jonathan Taylor included a formula framework in stats.models similar
to R, but I haven't looked very closely at it. I haven't learned much
of R's syntax and I usually prefer to build by own arrays (with some
exceptions such as polynomials) than hide them behind a mini model
For both stats.models and for the interface for general stats
functions, feedback would be very appreciated.


More information about the SciPy-user mailing list