[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson torgil.svensson@gmail....
Mon Jul 9 17:46:49 CDT 2007


Elegant solution. Very readable and takes care of row0 nicely.

I want to point out that this is much more efficient than my version
for random/late string representation changes throughout the
conversion but it suffers from 2*n memory footprint and large block
copying if the string rep changes arrives very early on huge datasets.
I think we can't have best of both and Tims solution is better in the
general case.

Maybe "use one_alt if rownumber < xxx else use other_alt" can
fine-tune performance for some cases. but even ten, with many cols,
it's nearly impossible to know.

//Torgil


On 7/9/07, Timothy Hochberg <tim.hochberg@ieee.org> wrote:
>
>
> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> > Thanks for looking into this Torgil! I agree that this is a much more
> > complicated setup. I'll check if there is anything I can do on the data
> end.
> > Otherwise I'll go with Timothy's suggestion and read in numbers as floats
> > and convert to int later as needed.
>
> Here is a strategy that should allow auto detection without too much in the
> way of inefficiency. The basic idea is to convert till you run into a
> problem, store that data away, and continue the conversion with a new dtype.
> At the end you assemble all the chunks of data you've accumulated into one
> large array. It should be reasonably efficient in terms of both memory and
> speed.
>
> The implementation is a little rough, but it should get the idea across.
>
> --
> .  __
> .   |-\
> .
> .  tim.hochberg@ieee.org
>
> ========================================================================
>
> def find_formats(items, last):
>     formats = []
>     for i, x in enumerate(items):
>         dt, cvt = string_to_dt_cvt(x)
>         if last is not None:
>             last_cvt, last_dt = last[i]
>             if last_cvt is float and cvt is int:
>                 cvt = float
>         formats.append((dt, cvt))
>     return formats
>
> class LoadInfo(object):
>     def __init__(self, row0):
>         self.done = False
>         self.lastcols = None
>         self.row0 = row0
>
> def data_iterator(lines, converters, delim, info):
>     yield tuple(f(x) for f, x in zip(converters, info.row0.split(delim)))
>     try:
>         for row in lines:
>             yield tuple(f(x) for f, x in zip(converters, row.split(delim)))
>     except:
>         info.row0 = row
>     else:
>         info.done = True
>
> def load2(fname,delim = ',', has_varnm = True, prn_report = True):
>     """
>      Loading data from a file using the csv module. Returns a recarray.
>     """
>     f=open(fname,'rb')
>
>     if has_varnm:
>         varnames = [i.strip() for i in f.next().split(delim)]
>      else:
>         varnames = None
>
>
>     info = LoadInfo(f.next())
>     chunks = []
>
>     while not info.done:
>         row0 = info.row0.split(delim)
>         formats = find_formats(row0, info.lastcols )
>         if varnames is None:
>             varnames = varnm = ['col%s' % str(i+1) for i, _ in
> enumerate(formate)]
>         descr=[]
>         conversion_functions=[]
>         for name, (dtype, cvt_fn) in zip(varnames, formats):
>             descr.append((name,dtype))
>             conversion_functions.append(cvt_fn)
>
>         chunks.append(N.fromiter(data_iterator(f, conversion_functions,
> delim, info), descr))
>
>     if len(chunks) > 1:
>         n = sum(len(x) for x in chunks)
>         data = N.zeros([n], chunks[-1].dtype)
>         offset = 0
>         for x in chunks:
>             delta = len(x)
>             data[offset:offset+delta] = x
>              offset += delta
>     else:
>         [data] = chunks
>
>     # load report
>     if prn_report:
>         print
> "##########################################\n"
>         print "Loaded file: %s\n" % fname
>         print "Nr obs: %s\n" % data.shape[0]
>         print "Variables and datatypes:\n"
>         for i in data.dtype.descr:
>             print "Varname: %s, Type: %s, Sample: %s" % (i[0], i[1],
> str(data[i[0]][0:3]))
>             print
> "\n##########################################\n"
>
>     return data
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>


More information about the Numpy-discussion mailing list