# [Numpy-discussion] String manipulation summary

Christopher Barker Chris.Barker@noaa....
Mon Jul 27 18:29:16 CDT 2009

Hi all,

When I first saws this problem: reading in a fixed-width text file as
numbers, it struck me that you really should be able to do it, and do it
well, with numpy by slicing character arrays.

I got carried away, and worked out a number of ways to do it. Lastly was
a method inspired by a recent thread: "String to integer array of ASCII
values", which did indeed inspire the fastest way. Here's what I have :

# my naive first attempt:
def line2array0(line, field_len):
nums = []
i = 0
while i < len(line):
nums.append(float(line[i:i+field_len]))
i += field_len
return np.array(nums)

# list comprehension
def line2array1(line, field_len):
return np.array(map(float,[line[i*field_len:(i+1)*field_len] for i
in range(len(line)/field_len)]))

# convert to a tuple, then to an 'S1' array -- no real reason to do
# this, as I figured out the next way.
def line2array2(line, field_len):
return np.array(tuple(line), dtype =
'S1').view(dtype='S%i'%field_len).astype(np.float)

# convert directly to a string array, then break into fields.
def line2array3(line, field_len):
return np.array((line,)).view(dtype='S%i'%field_len).astype(np.float)

# use dtype-'c' instead of 'S1' -- better.
def line2array4(line, field_len):
return np.array(line,
dtype='c').view(dtype='S%i'%field_len).astype(np.float)

# and the winner is: use fromstring to go straight to a 'c' array:
def line2array5(line, field_len):
return np.fromstring(line,
dtype='c').view(dtype='S%i'%field_len).astype(np.float)

Here are some timings:

Timing with a 10 number string:
List comp: 36.8073430061
convert to tuple: 57.9741871357
auto convert: 43.4103589058
char type: 46.0047719479
fromstring: 23.998103857
without float conversion: 11.4827179909

So list comprehension is pretty fast, but using fromstring, and then
slicing is much better. The last one is the same thing, but without the
convertion from strings to float, showing that that's a big chunk of
time no matter how you slice it.

Timing with a 100 number string:
List comp: 163.281736135
convert to tuple: 333.081432104
auto convert: 138.934411049
char type: 279.897207975
fromstring: 121.395509005
without float conversion: 12.8342208862

Interesting -- I thought a longer string would give greater advantage to
fromstring approach -- but I was wrong, now the time to parse strings
into floats is really washing everything else out -- so it doesn't
matter much how you do it, though I'd go with either list comprehension
(which is what I think is used in np.genfromtxt), or the fromstring
method, which I kind of like 'cause it's numpy.

test and timing code attached.

-Chris

--
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.py
Type: application/x-python
Size: 3584 bytes
Desc: not available
Url : http://mail.scipy.org/pipermail/numpy-discussion/attachments/20090727/ab86dd69/attachment-0002.bin