[SciPy-User] Minimum points for descriptive statistics?
Bruce Southey
bsouthey@gmail....
Mon Sep 12 10:56:43 CDT 2011
On 09/10/2011 10:40 PM, Mark Livingstone wrote:
> Hi Guys,
>
> I am slowly bringing up to date the SalStat statistics program at
> http://sourceforge.net/projects/salstat/ which uses Numpy to hold its
> data, and to do some of the statistical calculations.
No clue about your program and what you do.
>
> I have two questions which I would like to solicit statistical points
> of view on.
>
> In the GUI, I have a wxPython grid where, as you would expect you put
> a series into each column and stats are then able to be calculated.
>
> (a) What I am wondering is what is the minimum number of data points
> you would feel should be present to perform the standard 5 number
> statistics? I guess that technically if you had two points, you could
> interpolate the median, then Q1 & Q3 but this seems doubtful to me? 3
> numbers would seem a more solid proposal? Maybe we need an "Are you
> sure?" message box! ;-)
You should assume that the user knows what they want. Often a user wants
statistics on multiple variables so stopping it just for one variable is
stupid. Also apps often give more than one value by default so the user
does not care of the kurtosis is 'doubtful' because they only wanted the
sum or number of observations.
However, there is a computation restriction depending on how you compute
higher order moments (usually kurtosis requires more than 3 observations).
>
> (b) Is there any standard way that you deal with missing values (empty
> cells) in the data?
Two options for statistical operations (including tests): remove/exclude
or keep any missing values.
Typically missing values are excluded but that is often easier said than
done - masking or deleting can work.
If you keep missing values then any operation involving a missing value
is also missing. You might want to do that when not all 'columns'
contain missing values.
> Given that you can tick boxes to have a number of descriptive and
> other tests performed on a column, or between columns of data, it
> seems to me that different tests will have different ways to deal with
> missing data? It is not like you can just stick in some default value!
Actually you can put in a 'default value' (zero is good) provided that
you adjust your counts accordingly. Alternatively you get into multiple
imputation.
> Thanks in advance for any help you can suggest :-D
>
> Regards,
>
> MarkL
>
Take a very long look at Mark's NA mask work in numpy but note that
certain operations are not yet implemented:
http://mail.scipy.org/pipermail/numpy-discussion/2011-August/058103.html
It will provide similar functionality to how R handles missing values.
Bruce
More information about the SciPy-User
mailing list