[Numpy-discussion] convert csv file into recarray without pre-specifying dtypes and variable names

Torgil Svensson torgil.svensson@gmail....
Sun Jul 8 19:00:49 CDT 2007


FWIW

>>> n,dt=descr[0]
>>> new_dt=dt.replace('f','i')
>>> descr[0]=(n,new_dt)
>>> data=ra.col1.astype(new_dt)
>>> ra.dtype=N.dtype(descr)
>>> ra.col1=data

//Torgil

On 7/9/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>
>  Tim,
>
>  I do want to auto-detect. Reading numbers in as floats is probably not a
> huge penalty.
>
>  Is there an easy way to change the type of one column in a recarray that
> you know?
>
>  I tried this:
>
>  ra.col1 = ra.col1.astype('i')
>
>  but that didn't seem to work. I assume that means you would have to create
> a new array from the old one with an updated dtype list.
>
>  Thanks,
>
>  Vincent
>
>
>  On 7/8/07 4:51 PM, "Timothy Hochberg" <tim.hochberg@ieee.org> wrote:
>
>
>
>
>  On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>
> Torgil,
>
>  The function seems to work well and is slightly faster than your previous
>  version (about 1/6th faster).
>
>  Yes, I do have columns that start with, what looks like, int's and then
> turnTim,
>
>  out to be floats. Something like below (col6).
>
>      data = [['col1', 'col2', 'col3', 'col4', 'col5', 'col6'],
>              ['1','3','1/97','1.12','2.11','0'],
>              ['1','2','3/97',' 1.21','3.12','0'],
>              ['2','1','2/97','1.12','2.11','0'],
>              ['2','2','4/97','1.33','2.26',' 1.23'],
>              ['2','2','5/97','1.73','2.42','1.26']]
>
>  I think what your function assumes is that the 1st element will be the
>  appropriate type. That may not hold if you have missing values or 'mixed
>  types'.
>
>
>
>  Vincent,
>
>  Do you need to auto detect the column types? Things get a lot simpler if
> you have some known schema for each file; then you can simply pass that to
> some reader function. It's also more robust since there's no way in general
> to differentiate a column of integers from a column of floats with no
> decimal part.
>
>  If you do need to auto detect, one approach would be to always read both
> int-like stuff and float-like stuff in as floats. Then after you get the
> array check over the various columns and if any have no fractional parts,
> make a new array where those columns are integers.
>
>   -tim
>
>
> Best,
>
>  Vincent
>
>
>  On 7/8/07 3:31 PM, "Torgil Svensson" < torgil.svensson@gmail.com> wrote:
>
>  > Hi
>  >
>  > I stumble on these types of problems from time to time so I'm
>  > interested in efficient solutions myself.
>  >
>  > Do you have a column which starts with something suitable for int on
>  > the first row (without decimal separator) but has decimals further
>  > down?
>  >
>  > This will be little tricky to support. One solution could be to yield
>  > StopIteration, calculate new type-conversion-functions and start over
>  > iterating over both the old data and the rest of the iterator.
>  >
>  > It'd be great if you could try the load_gen_iter.py I've attached to
>  > my response to Tim.
>  >
>  > Best Regards,
>  >
>  > //Torgil
>  >
>  > On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>  >> I am not (yet) very familiar with much of the functionality introduced
> in
>  >> your script Torgil (izip, imap, etc.), but I really appreciate you
> taking
>  >> the time to look at this!
>  >>
>  >> The program stopped with the following error:
>  >>
>  >>   File "load_iter.py", line 48, in <genexpr>
>  >>     convert_row=lambda r: tuple(fn(x) for fn,x in
>  >> izip(conversion_functions,r))
>  >> ValueError: invalid literal for int() with base 10: '2174.875'
>  >>
>  >> A lot of the data I use can have a column with a set of int's (e.g.,
> 0's),
>  >> but then the rest of that same column could be floats. I guess finding
> the
>  >> right conversion function is the tricky part. I was thinking about
> sampling
>  >> each, say, 10th obs to test which function to use. Not sure how that
> would
>  >> work however.
>  >>
>  >> If I ignore the option of an int ( i.e., everything is a float, date, or
>  >> string) then your script is about twice as fast as mine!!
>  >>
>  >> Question: If you do ignore the int's initially, once the rec array is in
>  >> memory, would there be a quick way to check if the floats could pass as
>  >> int's? This may seem like a backwards approach but it might be 'safer'
> if
>  >> you really want to preserve the int's.
>  >>
>  >> Thanks again!
>  >>
>  >> Vincent
>  >>
>  >>
>  >> On 7/8/07 5:52 AM, "Torgil Svensson" <torgil.svensson@gmail.com> wrote:
>  >>
>  >>> Given that both your script and the mlab version preloads the whole
>  >>> file before calling numpy constructor I'm curious how that compares in
>  >>> speed to using numpy's fromiter function on your data. Using fromiter
>  >>> should improve on memory usage (~50% ?).
>  >>>
>  >>> The drawback is for string columns where we don't longer know the
>  >>> width of the largest item. I made it fall-back to "object" in this
>  >>> case.
>  >>>
>  >>> Attached is a fromiter version of your script. Possible speedups could
>  >>> be done by trying different approaches to the "convert_row" function,
>  >>> for example using "zip" or "enumerate" instead of "izip".
>  >>>
>  >>> Best Regards,
>  >>>
>  >>> //Torgil
>  >>>
>  >>>
>  >>> On 7/8/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu
> <mailto:v-nijs@kellogg.northwestern.edu> > wrote:
>  >>>> Thanks for the reference John! csv2rec is about 30% faster than my
> code on
>  >>>> the same data.
>  >>>>
>  >>>> If I read the code in csv2rec correctly it converts the data as it is
> being
>  >>>> read using the csv modules. My setup reads in the whole dataset into
> an
>  >>>> array of strings and then converts the columns as appropriate.
>  >>>>
>  >>>> Best,
>  >>>>
>  >>>> Vincent
>  >>>>
>  >>>>
>  >>>> On 7/6/07 8:53 PM, "John Hunter" <jdh2358@gmail.com> wrote:
>  >>>>
>  >>>>> On 7/6/07, Vincent Nijs <v-nijs@kellogg.northwestern.edu> wrote:
>  >>>>>> I wrote the attached (small) program to read in a text/csv file with
>  >>>>>> different data types and convert it into a recarray without having
> to
>  >>>>>> pre-specify the dtypes or variables names. I am just too lazy to
> type-in
>  >>>>>> stuff like that :) The supported types are int, float, dates, and
>  >>>>>> strings.
>  >>>>>>
>  >>>>>> I works pretty well but it is not (yet) as fast as I would like so I
> was
>  >>>>>> wonder if any of the numpy experts on this list might have some
>  >>>>>> suggestion
>  >>>>>> on how to speed it up. I need to read 500MB-1GB files so speed is
>  >>>>>> important
>  >>>>>> for me.
>  >>>>>
>  >>>>> In matplotlib.mlab svn, there is a function csv2rec that does the
>  >>>>> same.  You may want to compare implementations in case we can
>  >>>>> fruitfully cross pollinate them.  In the examples directy, there is
> an
>  >>>>> example script examples/loadrec.py
>  >>>>> _______________________________________________
>  >>>>> Numpy-discussion mailing list
>  >>>>> Numpy-discussion@scipy.org
>  >>>>>
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  >>>>>
>  >>>>
>  >>>>
>  >>>> _______________________________________________
>  >>>> Numpy-discussion mailing list
>  >>>> Numpy-discussion@scipy.org
> <mailto:Numpy-discussion@scipy.org>
>  >>>>
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  >>>>
>  >>> _______________________________________________
>  >>> Numpy-discussion mailing list
>  >>> Numpy-discussion@scipy.org
>  >>>
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  >>
>  >> --
>  >> Vincent R. Nijs
>  >> Assistant Professor of Marketing
>  >> Kellogg School of Management, Northwestern University
>  >> 2001 Sheridan Road, Evanston, IL 60208-2001
>  >> Phone: +1-847-491-4574 Fax: +1-847-491-2498
>  >> E-mail: v-nijs@kellogg.northwestern.edu
>  >> Skype: vincentnijs
>  >>
>  >>
>  >>
>  >> _______________________________________________
>  >> Numpy-discussion mailing list
>  >> Numpy-discussion@scipy.org
>  >>
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>  >>
>  > _______________________________________________
>  > Numpy-discussion mailing list
>  > Numpy-discussion@scipy.org
>  >
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
> <http://projects.scipy.org/mailman/listinfo/numpy-discussion>
>  >
>
>  --
>  Vincent R. Nijs
>  Assistant Professor of Marketing
>  Kellogg School of Management, Northwestern University
>  2001 Sheridan Road, Evanston, IL 60208-2001
>  Phone: +1-847-491-4574 Fax: +1-847-491-2498
>  E-mail: v-nijs@kellogg.northwestern.edu
>  Skype: vincentnijs
>
>
>
>  _______________________________________________
>  Numpy-discussion mailing list
>  Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
>
>  --
>  Vincent R. Nijs
>  Assistant Professor of Marketing
>  Kellogg School of Management, Northwestern University
>  2001 Sheridan Road, Evanston, IL 60208-2001
>  Phone: +1-847-491-4574 Fax: +1-847-491-2498
>  E-mail: v-nijs@kellogg.northwestern.edu
>  Skype: vincentnijs
>
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion@scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>


More information about the Numpy-discussion mailing list