[SciPy-User] Scipy's probplot compared to R's qqplot
Wed Mar 3 14:08:00 CST 2010
On Wed, Mar 3, 2010 at 2:49 PM, <PHobson@geosyntec.com> wrote:
>> On Wed, Mar 3, 2010 at 2:09 PM, <PHobson@geosyntec.com> wrote:
>> > Hey folks,
>> > I've taken more of an interest in statistics and Scipy lately and
>> decided to compare the scipy.stats.probplot() function to R's qqplot().
>> For a given dataset, the results are slightly different.
>> > Here's a link to the script I wrote to do the comparison.
>> > http://dpaste.com/167464/
>> > Basically, it does the following:
>> > -Uses numpy to generate some fake, noramlly distributed data
>> > -Uses both R and Scipy to compute the values needed for
>> quantile/probability plot
>> > -Computes linear regressions on the quantile data with both R and
>> > -prints some output to compare the two
>> > My initial conclusions:
>> > 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and
>> intercept of a linear model. (good)
>> > 2) R and Scipy compute the quantiles of a dataset in slightly different
>> manners (??)
>> > Any clue as to why the discrepancy in #2 occurs? Would you consider it
>> a big deal?
>> From: firstname.lastname@example.org [mailto:email@example.com]
>> On Behalf Of firstname.lastname@example.org
>> I would consider any significant deviation a big deal, unless we know
>> that there are differences in the definitions or underlying
>> I'm not sure what's going on since I never looked at the details of
>> probplot. However, when I plot the quantiles
>> >>> plt.plot(np.sort(qR))
>> >>> plt.plot(qS)
>> >>> plt.show()
>> then the graph looks almost the same except for the first and last point.
> Yes. When I plotted them, I could not visually distinguish them (see attached). I forgot to mention that.
>> differs in the second decimal, except for first and last observation.
>> My guess would be that there are some differences for example in the
>> continuity correction, or similar.
>> The boundary points, however, look suspicious.
> Thanks for looking further into this. When I saw that the slopes and intercepts were different, I immediately inspected just the max and min values (laziness, sorry). If I find some time next week, I'll dig around in the source and see if I can't figure out what's happening at those points.
my prime candidate for the 2nd decimal differences, are differences in
Ui[1:-1] = (i-0.3175)/(N+0.365)
There are several conventions, David Huard posted a list of them
attached to a ticket (?), for empirical cdf.
There might be another correction for boundary points that is different.
Ui[-1] = 0.5**(1.0/N)
Ui = 1-Ui[-1]
But for graphical inspection, R and scipy look close enough.
> -Paul H.
> SciPy-User mailing list
More information about the SciPy-User