# [SciPy-user] scipy.io.read_array: NaN in data file

Pierre GM pgmdevlist@gmail....
Wed Mar 11 11:26:46 CDT 2009

```Dharhas,
To find duplicates, you can use the following function (on SVN r2111).
find_duplicated_dates will give you a dictionary, you can then use the
values to decide what you want to do. remove_duplicated_dates will
strip the series to keep only the first occurrence of duplicated dates.

def find_duplicated_dates(series):
"""
Return a dictionary (duplicated dates <> indices) for the input
series.

The indices are given as a tuple of ndarrays, a la :meth:`nonzero`.

Parameters
----------
series : TimeSeries, DateArray
A valid :class:`TimeSeries` or :class:`DateArray` object.

Examples
--------
>>> series = time_series(np.arange(10),
dates=[2000, 2001, 2002, 2003, 2003,
2003, 2004, 2005, 2005, 2006],
freq='A')
>>> test = find_duplicated_dates(series)
{<A-DEC : 2003>: (array([3, 4, 5]),), <A-DEC : 2005>: (array([7,
8]),)}
"""
dates = getattr(series, '_dates', series)
steps = dates.get_steps()
duplicated_dates = tuple(set(dates[steps==0]))
indices = {}
for d in duplicated_dates:
indices[d] = (dates==d).nonzero()
return indices

def remove_duplicated_dates(series):
"""
Remove the entries of `series` corresponding to duplicated dates.

The series is first sorted in chronological order.
Only the first occurence of a date is then kept, the others are

Parameters
----------
series : TimeSeries
Time series to process
"""
dates = getattr(series, '_dates', series)
steps = np.concatenate(([1,], dates.get_steps()))
if not dates.is_chronological():
series = series.copy()
series.sort_chronologically()
dates = series._dates
return series[steps.nonzero()]

On Mar 11, 2009, at 9:13 AM, Dharhas Pothina wrote:

>
> In this particular case we know the cause:
>
> It is either :
>
> a) Overlapping files have been appended. ie file1 contains data from
> Jan1 to Feb1 and file2 contains data from jan1 to March1. The
> overlap region has identical data.
>
> b) The data comes from sequential deployments and there is an small
> overlap at the beginning of the second file. ie file1 has data from
> Jan1 to Feb1 and file2 contains data from Feb1 to March1. There may
> be a few data points overlap. These are junk because the equipment
> was set up in the lab and took measurements in the air until it was
> swapped with the installed instrument in the water.
>
> In both these cases it is appropriate to take the first value. In
> the second case we really should be stripping the bad data before
> appending but this is a work in progress. Right now we are
> developing a semi-automated QA/QC procedure to clean up data before
> posting it on the web. We presently use a mix of awk and shell
> scripts but I'm trying to convert everything to python to make it
> easier to use, more maintainable, have nicer plots than gnuplot and
> to develop a gui application to help us do this.
>
> - dharhas
>
>>>> Timmie <timmichelsen@gmx-topmail.de> 3/11/2009 4:35 AM >>>
>> Well, because there's no standard way to do that: when you have
>> duplicated dates, should you take the first  one? The last one ? Take
>> some kind of average of the values ?
> Sometimes, there are inherent faults in the data set. Therefore, a
> automatic
> treatment may introduce further errors.
> It's only possible when this errors are occuring somewhat
> systematically.
>
>
>
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
> _______________________________________________
> SciPy-user mailing list
> SciPy-user@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user

```