[SciPy-User] Again on Calculating (Conditional) Time Intervals

Bruce Southey bsouthey@gmail....
Tue Jun 1 16:51:35 CDT 2010


I really don't understand what you are trying to do but here goes...

On 06/01/2010 10:16 AM, Lorenzo Isella wrote:
> Dear All,
> I hope this is not too off-topic. I have digged up an old email I posted which went unanswered quite some time ago.
> I made some progress on a simpler problem than the one for which I initially asked for help and I am attaching at the end of the email
> my own scripts. I anyone can help me to progress a bit further, I will be very grateful.
> Consider and array of this kind:
>
> 1      12    45
> 2      7       12
> 2      15     37
> 3      25     89
> 3      8       13
> 3      13     44
> 4      77     89
> 4      77     89
> 5      12     22
> 8      12     22
> 9      15     22
> 11    22     37
> 23    3       12
> 24    18     37
> 25    1      12
>
>
>
> where the first column is time measured in some units. The other two
> columns are some ID's identifying  infected individuals establishing a
> contact at the corresponding time. As you can see, there may be
> time-gaps in my recorded times and there may be repeated times if
> several contacts take place simultaneously.
How do you know that 'several contacts take place simultaneously' or is 
this due to another variable or rounding ?
For example, should you just drop the duplicated line '4 77  89'?

I would be tempted to store this is in 2-d matrix, say contact, where 
row and columns are the IDs. The duplicated occurrences suggest using 
either an another axis or a tuple. You could store a tuple (or even a 
numpy 1-d array) using a dictionary for a sparse matrix (eg 
http://openbookproject.net//thinkCSpy/ch12.html).

> The ID's are always sorted
> out in such a way the ID number of the 2nd column is always smaller than
> the corresponding entry of the third column (I am obviously indexing
> everything from 1).
> Now, this is my problem: I want to look at a specific ID I will call A
> (let us say A is 12) and calculate all the time differences t_AC-t_AB
> for B!=C, i.e. all the time intervals between the most recent contact
> between A and B and the first subsequent contact between and A and C
> (which has to be different from B).
> An example to fix the ideas: A=12, B=22, C=1, then
> t_AB=8 (pick the most recent one before t_AC)
>    
Is this '5' or '8'?

> t_AC=25,
>    


> hence t_AC-t_AB=25-8=17. (but let me say it again: I want to be able to
> calculate all such intervals for any B and C on the fly).
>    
Except for what happens next, this could be
contact[(1, 12)]- contact[(12, 22)]

> It should be clear at this point that the calculated t_AC-t_AB !=
> t_AB-t_AC as some time-ordering is implicit in the definition (in
> t_AC-t_AB, AC contacts have to always be more recent than AB contacts).
> Even in the case of multiple disjointed AB and AC contacts, I always
> have to look for the closest time intervals in time. E.g. if I had
>
> 10    12       22
> 40    12       22
> 60    1         12
> 100  1         12
> 110  12       22
> 130  12       22
> 150  1         12
>
> then I would work out the time intervals 60-40=20 and 150-130=20.
>    
How do you define these two intervals without knowing the values in 
advance?
What does 'closest' really mean?

So, given the above constraint that AC < AB,  '60-40' can come from:
AC contacts is [10, 40, 110, 130] (say from contacts[(1, 22)]
AB contacts is [60, 100, 150] (say from contacts[(1, 12)]

min(AB contacts) minus max(AC contacts that are smaller than min(AB 
contacts))

min(AB contacts)= 60
max(AC contacts that are smaller than min(AB contacts))= max([10, 40])=40
(How you do that last one is easier with numpy array because of the easy 
to find the elements smaller than the AB contacts.)

Otherwise you have to do some 'smart' way to find the minimum difference 
of all pairs that is greater than zero.
60-10
60-40
60-110
60-130
100-10
100-40
...
> Now, thanks to the help I got from the list, I am able to calculate the
> distribution of contact and interval durations between IDs (something simpler than the conditional
> time interval). See the code at the end of the email which you can run on the first dataset I provide in this email to get
> the contact/interval distributions.
> Sorry for the long email, but any suggestion about how to calculate the conditional probability efficiently would help me a great deal.
> Many thanks
>
> Lorenzo
>
>    
Bruce

> #!/usr/bin/env python
> import scipy as s
> import pylab as p
> import numpy as n
> import sys
> import string
>
>
> def single_tag_contact_times(sliced_data, tag_id):
>
>
>      #I can follow a given tag by selecting his ID number and looking
>      #for it through the data
>
>      sel=s.where(sliced_data==tag_id)
>
>      #now I need to add a condition in case the id of the tag I have chosen is non-existing
>
>
>      if (len(sel[0])==0):
>          #print "the chosen tag does not exist"
>          return
>
>      tag_contact_times=sliced_data[sel[0],0] #I select the times at which
>      #the tag I am tracking undergoes a contact.
>
>
>
>      tag_no_rep=n.unique1d(tag_contact_times) #The idea is the following:
>      #in a given time interval delta_slice, a tag may undergo multiple contacts
>      #with different tags. This corresponds to different entries in the output
>      #of time_binned_interaction. That function does not allow for multiple contacts between
>      # the SAME two tags being reported in the same time slice, but it allows the same tag ID
>      #to appear twice in the same time window if it gets in touch with TWO different tags
>      #within the same delta_slice. It is fundamental to know that in a given time slice
>      #tag A has estabilished contact with tag B and tag C (if I discard any bit of this info,
>      #then I lose info about the state of the network at that time slice), but when it comes to
>      #simply having the time-dependent distribution of contact durations and intervals between
>      #any two contacts estabilished by packet A, I will simply say that tag A reported a contact
>      #in that given time-slice. More sophisticated statistics (e.g. the number of contacts
>      #estabilished by tag A in a given time slice), can be implemented if found useful/needed
>      #later on.
>
>
>
>      #p.save("single_tag_contact_times_no_rep.dat",tag_no_rep,fmt='%d')
>
>      return  tag_no_rep
>
>
> def contact_duration_and_interval_many_tags(sliced_interactions,\
>                                           delta_slice, counter):
>
>      #I added this line since now there is no guarantee that in the edge list
>      # (contact list) tag_A---tag_B, the id of tag_A is<= id of tag_B.
>
>      sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])
>
>      #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
>      #thus outputting the distribution of time intervals between any two contacts in the system.
>
>      tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
>      #all tag ID`s, which appear (repeated) on two rows of the matrix output by
>      # time_binned_interaction
>
>
>      #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')
>
>
>      # tag_ids=tag_ids.astype('int')
>
>
>
>      #print "tag_ids is, ", tag_ids
>
>      overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
>      #contacts for all the tags in the system.
>
>
>
>      overall_duration=s.zeros(0) #this array will contain the time duration of the
>      #contacts for all the tags in the system.
>
>
>
>      for i in xrange(len(tag_ids)):
>          track_tag_id=tag_ids[i]  #i.e. iterate on all tags
>
>          contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
>          #an array with all the interactions of a given tag
>
>          #print "contact_times is, ", contact_times
>
>          results=contact_duration_and_interval_single_tag(contact_times, delta_slice)
>
>          tag_duration=results[0]
>
>
>          tag_intervals=results[1] #get
>          #an array with the time intervals between two contacts for a given tag
>
>
>          #print "tag_intervals is, ", tag_intervals
>
>          overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
>          #the results on all tags
>
>
>          #print "overall_gaps is, ", overall_gaps
>
>          overall_duration=s.hstack((overall_duration,tag_duration))
>
>      #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
>      #overall_duration=overall_duration[s.where(overall_duration !=0)]
>      filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
>      filename=filename+"_.dat"
>
>      n.savetxt(filename, overall_gaps ,  fmt='%d')
>
>      filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
>      filename=filename+"_.dat"
>
>
>      n.savetxt(filename, overall_duration ,  fmt='%d')
>
>      return overall_duration, overall_gaps
>
>
>
> def contact_duration_and_interval_many_tags(sliced_interactions,\
>                                           delta_slice, counter):
>
>      #I added this line since now there is no guarantee that in the edge list
>      # (contact list) tag_A---tag_B, the id of tag_A is<= id of tag_B.
>
>      sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])
>
>      #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
>      #thus outputting the distribution of time intervals between any two contacts in the system.
>
>      tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
>      #all tag ID`s, which appear (repeated) on two rows of the matrix output by
>      # time_binned_interaction
>
>
>      #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')
>
>
>      # tag_ids=tag_ids.astype('int')
>
>
>
>      #print "tag_ids is, ", tag_ids
>
>      overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
>      #contacts for all the tags in the system.
>
>
>
>      overall_duration=s.zeros(0) #this array will contain the time duration of the
>      #contacts for all the tags in the system.
>
>
>
>      for i in xrange(len(tag_ids)):
>          track_tag_id=tag_ids[i]  #i.e. iterate on all tags
>
>          contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
>          #an array with all the interactions of a given tag
>
>          #print "contact_times is, ", contact_times
>
>          results=contact_duration_and_interval_single_tag(contact_times, delta_slice)
>
>          tag_duration=results[0]
>
>
>          tag_intervals=results[1] #get
>          #an array with the time intervals between two contacts for a given tag
>
>
>          #print "tag_intervals is, ", tag_intervals
>
>          overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
>          #the results on all tags
>
>
>          #print "overall_gaps is, ", overall_gaps
>
>          overall_duration=s.hstack((overall_duration,tag_duration))
>
>      #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
>      #overall_duration=overall_duration[s.where(overall_duration !=0)]
>      filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
>      filename=filename+"_.dat"
>
>      n.savetxt(filename, overall_gaps ,  fmt='%d')
>
>      filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
>      filename=filename+"_.dat"
>
>
>      n.savetxt(filename, overall_duration ,  fmt='%d')
>
>      return overall_duration, overall_gaps
>
>
> def contact_duration_and_interval_single_tag(single_tag_no_rep, delta_slice):
>
>      #the following if condition is useful only when I am really tracking a particular
>      #tag whose ID is given a priory but which may not exist at all (in the sense that
>      #it would not estabilish any contact) in the time window during which I am studying
>      #the system.
>
>
>      if (single_tag_no_rep==None):
>          print "The chosen tag does not exist hence no analysis can be performed on it"
>          return
>
>
>
>      # delta_slice=int(delta_slice) #I do not need floating point arithmetic
>
>      single_tag_no_rep=(single_tag_no_rep-single_tag_no_rep[0])/delta_slice
>      gaps=s.diff(single_tag_no_rep) #a bit more efficient than the line above
>
>      #print "gaps is, ", gaps
>
>      #gaps is now an array of integers. It either has a list of consecutive 1`s
>      # (which means a contact duration of delta_slice times the number of consecutive ones)
>      # of an entry higher than one which expresses (in units of delta_slice) the time during
>      #which the tag underwent no contact
>
>
>      #p.save("gaps.dat",gaps, fmt='%d')
>
>      # find_gap=s.where(gaps != 1)[0]
>
>      find_gap=s.where(gaps>  1)[0] #a better definition: a tag may estabilish
>      #several contacts within the same timeslice. So I may have some zeros in
>      #gaps due to different simultaneous contacts. a tag is truly disconnected
>      #from all the others when I see an increment larger than  one in the
>      #rescaled time.
>
>
>      gap_distr=(gaps[find_gap]-1)*delta_slice  #so, this is really the list of the
>      #time interval between two contacts for my tag. After the discussion with Ciro,
>      #I modified slightly the definition (now there is a -1) in the definition.
>      #It probably does not matter much for the calculated distribution.
>
>      #print "gap_distr is, ", gap_distr
>      #NB: the procedure above does NOT break down is gap_distr is empty
>
>
>      #Now I calculate the duration of the contacts of my tag. I changed this bit since
>      #I had new discussions with Ciro
>
>      #single_tag_no_rep=s.hstack((0,single_tag_no_rep))
>
>      #print "single_tag_no_rep is, ", single_tag_no_rep, "and its length is, ", len(single_tag_no_rep)
>
>      # e2=s.diff(single_tag_no_rep)
>
>      # #print "e2 is, ", e2
>
>      # sel=s.where(e2!=1)[0]
>      # #print "sel is, ", sel
>
>      #sel=s.where(gaps!=1)[0]
>
>      # res=0 #this will contain the results and will be overwritten
>
>      #What is here needs to be tested very carefully! There may be some bugs
>
>
>      sol=s.hstack((0,find_gap,len(gaps)))
>      #print "sol is, ", sol
>
>
>
>      res=s.diff(sol)
>      #print "res initially is, ", res
>
>
>      res[0]=res[0]+1 #to account for troubles I normally have at the beginning of the array
>
>      #print "res is, ", res
>
>
>
>
>      res=res*delta_slice
>
>
>      #print "the sum of all the durations is, ", res.sum()
>
>      return [res,gap_distr]
>
>
> f = open(sys.argv[1])
> sliced_interactions = [map(int, string.split(line)) for line in f.readlines()]
> f.close()
>
> print ("sliced_interactions is, ", sliced_interactions)
>
> sliced_interactions = s.array(sliced_interactions, dtype="int64")
>
> print ("sliced_interactions is now, ", sliced_interactions)
>
> counter=0
>
> delta_slice=1
>
> contact_duration_and_interval_many_tags(sliced_interactions,\
>                                              delta_slice,counter)
>
>
>
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User@scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>    

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.scipy.org/pipermail/scipy-user/attachments/20100601/a33b41d8/attachment-0001.html 


More information about the SciPy-User mailing list