[Numpy-discussion] array slicing questions
Vlastimil Brom
vlastimil.brom@gmail....
Tue Jul 31 09:01:43 CDT 2012
2012/7/31 eat <e.antero.tammi@gmail.com>:
> Hi,
>
> On Tue, Jul 31, 2012 at 10:23 AM, Vlastimil Brom <vlastimil.brom@gmail.com>
> wrote:
>>
>> 2012/7/30 eat <e.antero.tammi@gmail.com>:
>> > Hi,
>> >
>> > A partial answer to your questions:
>> >
>> > On Mon, Jul 30, 2012 at 10:33 PM, Vlastimil Brom
>> > <vlastimil.brom@gmail.com>
>> > wrote:
>> >>
>> >> Hi all,
>> >> I'd like to ask for some hints or advice regarding the usage of
>> >> numpy.array and especially slicing.
>> >>
>> >> I only recently tried numpy and was impressed by the speedup in some
>> >> parts of the code, hence I suspect, that I might miss some other
>> >> oportunities in this area.
>> >>
>> >> I currently use the following code for a simple visualisation of the
>> >> search matches within the text, the arrays are generally much larger
>> >> than the sample - the texts size is generally hundreds of kilobytes up
>> >> to a few MB - with an index position for each character.
>> >> First there is a list of spans(obtained form the regex match objects),
>> >> the respective character indices in between these slices should be set
>> >> to 1:
>> >>
>> >> >>> import numpy
>> >> >>> characters_matches = numpy.zeros(10)
>> >> >>> matches_spans = numpy.array([[2,4], [5,9]])
>> >> >>> for start, stop in matches_spans:
>> >> ... characters_matches[start:stop] = 1
>> >> ...
>> >> >>> characters_matches
>> >> array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.])
>> >>
>> >> Is there maybe a way tu achieve this in a numpy-only way - without the
>> >> python loop?
>> >> (I got the impression, the powerful slicing capabilities could make it
>> >> possible, bud haven't found this kind of solution.)
>> >>
>> >>
>> >> In the next piece of code all the character positions are evaluated
>> >> with their "neighbourhood" and a kind of running proportions of the
>> >> matched text parts are computed (the checks_distance could be
>> >> generally up to the order of the half the text length, usually less :
>> >>
>> >> >>>
>> >> >>> check_distance = 1
>> >> >>> floating_checks_proportions = []
>> >> >>> for i in numpy.arange(len(characters_matches)):
>> >> ... lo = i - check_distance
>> >> ... if lo < 0:
>> >> ... lo = None
>> >> ... hi = i + check_distance + 1
>> >> ... checked_sublist = characters_matches[lo:hi]
>> >> ... proportion = (checked_sublist.sum() / (check_distance * 2 +
>> >> 1.0))
>> >> ... floating_checks_proportions.append(proportion)
>> >> ...
>> >> >>> floating_checks_proportions
>> >> [0.0, 0.33333333333333331, 0.66666666666666663, 0.66666666666666663,
>> >> 0.66666666666666663, 0.66666666666666663, 1.0, 1.0,
>> >> 0.66666666666666663, 0.33333333333333331]
>> >> >>>
>> >
>> > Define a function for proportions:
>> >
>> > from numpy import r_
>> >
>> > from numpy.lib.stride_tricks import as_strided as ast
>> >
>> > def proportions(matches, distance= 1):
>> >
>> > cd, cd2p1, s= distance, 2* distance+ 1, matches.strides[0]
>> >
>> > # pad
>> >
>> > m= r_[[0.]* cd, matches, [0.]* cd]
>> >
>> > # create a suitable view
>> >
>> > m= ast(m, shape= (m.shape[0], cd2p1), strides= (s, s))
>> >
>> > # average
>> >
>> > return m[:-2* cd].sum(1)/ cd2p1
>> > and use it like:
>> > In []: matches
>> > Out[]: array([ 0., 0., 1., 1., 0., 1., 1., 1., 1., 0.])
>> >
>> > In []: proportions(matches).round(2)
>> > Out[]: array([ 0. , 0.33, 0.67, 0.67, 0.67, 0.67, 1. , 1. ,
>> > 0.67,
>> > 0.33])
>> > In []: proportions(matches, 5).round(2)
>> > Out[]: array([ 0.27, 0.36, 0.45, 0.55, 0.55, 0.55, 0.55, 0.55,
>> > 0.45,
>> > 0.36])
>> >>
>> >>
>> >> I'd like to ask about the possible better approaches, as it doesn't
>> >> look very elegant to me, and I obviously don't know the implications
>> >> or possible drawbacks of numpy arrays in some scenarios.
>> >>
>> >> the pattern
>> >> for i in range(len(...)): is usually considered inadequate in python,
>> >> but what should be used in this case as the indices are primarily
>> >> needed?
>> >> is something to be gained or lost using (x)range or np.arange as the
>> >> python loop is (probably?) inevitable anyway?
>> >
>> > Here np.arange(.) will create a new array and potentially wasting memory
>> > if
>> > it's not otherwise used. IMO nothing wrong looping with xrange(.) (if
>> > you
>> > really need to loop ;).
>> >>
>> >> Is there some mor elegant way to check for the "underflowing" lower
>> >> bound "lo" to replace with None?
>> >>
>> >> Is it significant, which container is used to collect the results of
>> >> the computation in the python loop - i.e. python list or a numpy
>> >> array?
>> >> (Could possibly matplotlib cooperate better with either container?)
>> >>
>> >> And of course, are there maybe other things, which should be made
>> >> better/differently?
>> >>
>> >> (using Numpy 1.6.2, python 2.7.3, win XP)
>> >
>> >
>> > My 2 cents,
>> > -eat
>> >>
>> >> Thanks in advance for any hints or suggestions,
>> >> regards,
>> >> Vlastimil Brom
>> >> _______________________________________________
>> >> NumPy-Discussion mailing list
>> >> NumPy-Discussion@scipy.org
>> >> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>> >
>> Hi,
>> thank you very much for your suggestions!
>>
>> do I understand it correctly, that I have to special-case the function
>> for distance = 0 (which should return the matches themselves without
>> recalculation)?
>
> Yes.
>>
>>
>> However, more importantly, I am getting a ValueError for some larger,
>> (but not completely unreasonable) "distance"
>>
>> >>> proportions(matches, distance= 8190)
>> Traceback (most recent call last):
>> File "<input>", line 1, in <module>
>> File "<input>", line 11, in proportions
>> File "C:\Python27\lib\site-packages\numpy\lib\stride_tricks.py",
>> line 28, in as_strided
>> return np.asarray(DummyArray(interface, base=x))
>> File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line
>> 235, in asarray
>> return array(a, dtype, copy=False, order=order)
>> ValueError: array is too big.
>> >>>
>>
>> the distance= 8189 was the largest which worked in this snippet,
>> however, it might be data-dependent, as I got this error as well e.g.
>> for distance=4529 for a 20k text.
>>
>> Is this implementation-limited, or could it be solved in some
>> alternative way which wouldn't have such limits (up to the order of,
>> say, millions)?
>
> Apparently ast(.) does not return a view of the original matches rather a
> copy of size (n* (2* distance+ 1)), thus you may run out of memory.
>
> Surely it can be solved up to millions of matches, but perhaps much slower
> speed.
>
>
> Regards,
> -eat
>>
>>
>> Thanks again
>> regards
>> vbr
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
Thank you for the confirmation,
I'll wait and see, whether the current speed isn't actually already
acceptable for the most cases...
I could already gain a speedup by using the array.sum() and other
features, maybe I will find yet other possibilities.
regards,
vbr
More information about the NumPy-Discussion
mailing list