[Numpy-discussion] numpy.percentile multiple arrays

Brett Olsen brett.olsen@gmail....
Tue Jan 24 20:26:28 CST 2012

```On Tue, Jan 24, 2012 at 6:22 PM, questions anon
<questions.anon@gmail.com> wrote:
> I need some help understanding how to loop through many arrays to calculate
> the 95th percentile.
> I can easily do this by using numpy.concatenate to make one big array and
> then finding the 95th percentile using numpy.percentile but this causes a
> memory error when I want to run this on 100's of netcdf files (see code
> below).
> Any alternative methods will be greatly appreciated.
>
>
> all_TSFC=[]
> for (path, dirs, files) in os.walk(MainFolder):
>     for dir in dirs:
>         print dir
>     path=path+'/'
>     for ncfile in files:
>         if ncfile[-3:]=='.nc':
>             print "dealing with ncfiles:", ncfile
>             ncfile=os.path.join(path,ncfile)
>             ncfile=Dataset(ncfile, 'r+', 'NETCDF4')
>             TSFC=ncfile.variables['T_SFC'][:]
>             ncfile.close()
>             all_TSFC.append(TSFC)
>
> big_array=N.ma.concatenate(all_TSFC)
> Percentile95th=N.percentile(big_array, 95, axis=0)

If the range of your data is known and limited (i.e., you have a
comparatively small number of possible values, but a number of repeats
of each value) then you could do this by keeping a running cumulative
distribution function as you go through each of your files.  For each
file, calculate a cumulative distribution function --- at each
possible value, record the fraction of that population strictly less
than that value --- and then it's straightforward to combine the
cumulative distribution functions from two separate files:
cumdist_both = (cumdist1 * N1 + cumdist2 * N2) / (N1 + N2)

Then once you've gone through all the files, look for the value where
your cumulative distribution function is equal to 0.95.  If your data
isn't structured with repeated values, though, this won't work,
because your cumulative distribution function will become too big to
hold into memory.  In that case, what I would probably do would be an
iterative approach:  make an approximation to the exact function by
removing some fraction of the possible values, which will provide a
limited range for the exact percentile you want, and then walk through
the files again calculating the function more exactly within the
limited range, repeating until you have the value to the desired
precision.

~Brett
```