[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Timothy Hochberg tim.hochberg@ieee....
Sun Jul 8 16:51:23 CDT 2007


On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>
> Torgil,
>
> The function seems to work well and is slightly faster than your previous
> version (about 1/6th faster).
>
> Yes, I do have columns that start with, what looks like, int's and then
> turn
> out to be floats. Something like below (col6).
>
>     data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
>             ['1','3','1/97','1.12','2.11','0'],
>             ['1','2','3/97','1.21','3.12','0'],
>             ['2','1','2/97','1.12','2.11','0'],
>             ['2','2','4/97','1.33','2.26','1.23'],
>             ['2','2','5/97','1.73','2.42','1.26']]
>
> I think what your function assumes is that the 1st element will be the
> appropriate type. That may not hold if you have missing values or 'mixed
> types'.



Vincent,

Do you need to auto detect the column types? Things get a lot simpler if you
have some known schema for each file; then you can simply pass that to some
reader function. It's also more robust since there's no way in general to
differentiate a column of integers from a column of floats with no decimal
part.

If you do need to auto detect, one approach would be to always read both
int-like stuff and float-like stuff in as floats. Then after you get the
array check over the various columns and if any have no fractional parts,
make a new array where those columns are integers.

 -tim

Best,
>
> Vincent
>
>
> On 7/8/07 3:31 PM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
>
> > Hi
> >
> > I stumble on these types of problems from time to time so I'm
> > interested in efficient solutions myself.
> >
> > Do you have a column which starts with something suitable for int on
> > the first row (without decimal separator) but has decimals further
> > down?
> >
> > This will be little tricky to support. One solution could be to yield
> > StopIteration, calculate new type-conversion-functions and start over
> > iterating over both the old data and the rest of the iterator.
> >
> > It'd be great if you could try the load_gen_iter.py I've attached to
> > my response to Tim.
> >
> > Best Regards,
> >
> > //Torgil
> >
> > On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> >> I am not (yet) very familiar with much of the functionality introduced
> in
> >> your script Torgil (izip, imap, etc.), but I really appreciate you
> taking
> >> the time to look at this!
> >>
> >> The program stopped with the following error:
> >>
> >>   File "load_iter.py", line 48, in <genexpr>
> >>     convert_row=lambda r: tuple(fn(x) for fn,x in
> >> izip(conversion_functions,r))
> >> ValueError: invalid literal for int() with base 10: '2174.875'
> >>
> >> A lot of the data I use can have a column with a set of int¹s (e.g.,
> 0¹s),
> >> but then the rest of that same column could be floats. I guess finding
> the
> >> right conversion function is the tricky part. I was thinking about
> sampling
> >> each, say, 10th obs to test which function to use. Not sure how that
> would
> >> work however.
> >>
> >> If I ignore the option of an int (i.e., everything is a float, date, or
> >> string) then your script is about twice as fast as mine!!
> >>
> >> Question: If you do ignore the int's initially, once the rec array is
> in
> >> memory, would there be a quick way to check if the floats could pass as
> >> int's? This may seem like a backwards approach but it might be 'safer'
> if
> >> you really want to preserve the int's.
> >>
> >> Thanks again!
> >>
> >> Vincent
> >>
> >>
> >> On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
> >>
> >>> Given that both your script and the mlab version preloads the whole
> >>> file before calling numpy constructor I'm curious how that compares in
> >>> speed to using numpy's fromiter function on your data. Using fromiter
> >>> should improve on memory usage (~50% ?).
> >>>
> >>> The drawback is for string columns where we don't longer know the
> >>> width of the largest item. I made it fall-back to "object" in this
> >>> case.
> >>>
> >>> Attached is a fromiter version of your script. Possible speedups could
> >>> be done by trying different approaches to the "convert_row" function,
> >>> for example using "zip" or "enumerate" instead of "izip".
> >>>
> >>> Best Regards,
> >>>
> >>> //Torgil
> >>>
> >>>
> >>> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> >>>> Thanks for the reference John! csv2rec is about 30% faster than my
> code on
> >>>> the same data.
> >>>>
> >>>> If I read the code in csv2rec correctly it converts the data as it is
> being
> >>>> read using the csv modules. My setup reads in the whole dataset into
> an
> >>>> array of strings and then converts the columns as appropriate.
> >>>>
> >>>> Best,
> >>>>
> >>>> Vincent
> >>>>
> >>>>
> >>>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
> >>>>
> >>>>> On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
> >>>>>> I wrote the attached (small) program to read in a text/csv file
> with
> >>>>>> different data types and convert it into a recarray without having
> to
> >>>>>> pre-specify the dtypes or variables names. I am just too lazy to
> type-in
> >>>>>> stuff like that :) The supported types are int, float, dates, and
> >>>>>> strings.
> >>>>>>
> >>>>>> I works pretty well but it is not (yet) as fast as I would like so
> I was
> >>>>>> wonder if any of the numpy experts on this list might have some
> >>>>>> suggestion
> >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is
> >>>>>> important
> >>>>>> for me.
> >>>>>
> >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the
> >>>>> same.  You may want to compare implementations in case we can
> >>>>> fruitfully cross pollinate them.  In the examples directy, there is
> an
> >>>>> example script examples/loadrec.py
> >>>>> _______________________________________________
> >>>>> Numpy-discussion mailing list
> >>>>> Numpy-discussion@scipy.org
> >>>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Numpy-discussion mailing list
> >>>> Numpy-discussion@scipy.org
> >>>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>>>
> >>> _______________________________________________
> >>> Numpy-discussion mailing list
> >>> Numpy-discussion@scipy.org
> >>> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>
> >> --
> >> Vincent R. Nijs
> >> Assistant Professor of Marketing
> >> Kellogg School of Management, Northwestern University
> >> 2001 Sheridan Road, Evanston, IL 60208-2001
> >> Phone: +1-847-491-4574 Fax: +1-847-491-2498
> >> E-mail: v-nijs@kellogg.northwestern.edu
> >> Skype: vincentnijs
> >>
> >>
> >>
> >> _______________________________________________
> >> Numpy-discussion mailing list
> >> Numpy-discussion@scipy.org
> >> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >>
> > _______________________________________________
> > Numpy-discussion mailing list
> > Numpy-discussion@scipy.org
> > http://projects.scipy.org/mailman/listinfo/numpy-discussion
> >
>
> --
> Vincent R. Nijs
> Assistant Professor of Marketing
> Kellogg School of Management, Northwestern University
> 2001 Sheridan Road, Evanston, IL 60208-2001
> Phone: +1-847-491-4574 Fax: +1-847-491-2498
> E-mail: v-nijs@kellogg.northwestern.edu
> Skype: vincentnijs
>
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>



-- 
.  __
.   |-\
.
.  tim.hochberg@ieee.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://projects.scipy.org/pipermail/numpy-discussion/attachments/20070708/c8ced6d8/attachment-0001.html 


More information about the Numpy-discussion mailing list