[Numpy-discussion] Numpy 2D array from a list error

Bruce Southey bsouthey@gmail....
Wed Sep 23 09:12:08 CDT 2009


On 09/23/2009 08:42 AM, davew0000 wrote:
> Hi,
>
> I've got a fairly large (but not huge, 58mb) tab seperated text file, with
> approximately 200 columns and 56k rows of numbers and strings.
>
> Here's a snippet of my code to create a numpy matrix from the data file...
>
> ####
>
> data = map(lambda x : x.strip().split('\t'), sys.stdin.readlines())
> data = array(data)
>    



> ###
> data = array(data)
> It causes the following error:
>
>    
>> ValueError: setting an array element with a sequence
>>      
> If I take the 1st 40,000 lines of the file, it works fine.
> If I take the last 40,000 lines of the file, it also works fine, so it isn't
> a problem with the file.
>
> I've found a few other posts complaining of the same problem, but none of
> their fixes work.
>
> It seems like a memory problem to me. This was reinforced when I tried to
> break the dataset into 3 chunks and stack the resulting arrays - I got an
> error message saying "memory error".
> I don't really understand why reading in this 57mb txt file is taking up
> ~2gb's of RAM.
>
> Any advice? Thanks in advance
>
> Dave
>    
If the text file has 'numbers and strings' how is numpy meant to know 
what dtype to use?
Please try genfromtxt especially if columns contain both numbers and 
strings.

What happens if you read a file instead of using stdin?

It is possible that one or more rows have multiple sequential delimiters.
Please check the row lengths of your 'data' variable after doing:

data = map(lambda x : x.strip().split('\t'), sys.stdin.readlines())

Really without the input or system, it is hard to say anything.
If you really know your data I would suggest preallocating the array and updating the array one line at a time to avoid the large multiple intermediate objects.

Bruce





More information about the NumPy-Discussion mailing list