[SciPy-User] Again on Calculating (Conditional) Time Intervals

Lorenzo Isella lorenzo.isella@gmail....
Tue Jun 1 10:16:35 CDT 2010


Dear All,
I hope this is not too off-topic. I have digged up an old email I posted which went unanswered quite some time ago.
I made some progress on a simpler problem than the one for which I initially asked for help and I am attaching at the end of the email 
my own scripts. I anyone can help me to progress a bit further, I will be very grateful.
Consider and array of this kind:

1      12    45
2      7       12
2      15     37
3      25     89
3      8       13
3      13     44
4      77     89
4      77     89
5      12     22
8      12     22
9      15     22
11    22     37
23    3       12
24    18     37
25    1      12



where the first column is time measured in some units. The other two 
columns are some ID's identifying  infected individuals establishing a 
contact at the corresponding time. As you can see, there may be 
time-gaps in my recorded times and there may be repeated times if 
several contacts take place simultaneously. The ID's are always sorted 
out in such a way the ID number of the 2nd column is always smaller than 
the corresponding entry of the third column (I am obviously indexing 
everything from 1).
Now, this is my problem: I want to look at a specific ID I will call A 
(let us say A is 12) and calculate all the time differences t_AC-t_AB 
for B!=C, i.e. all the time intervals between the most recent contact 
between A and B and the first subsequent contact between and A and C 
(which has to be different from B).
An example to fix the ideas: A=12, B=22, C=1, then
t_AB=8 (pick the most recent one before t_AC)
t_AC=25,
hence t_AC-t_AB=25-8=17. (but let me say it again: I want to be able to 
calculate all such intervals for any B and C on the fly).
It should be clear at this point that the calculated t_AC-t_AB != 
t_AB-t_AC as some time-ordering is implicit in the definition (in 
t_AC-t_AB, AC contacts have to always be more recent than AB contacts).
Even in the case of multiple disjointed AB and AC contacts, I always 
have to look for the closest time intervals in time. E.g. if I had

10    12       22
40    12       22
60    1         12
100  1         12
110  12       22
130  12       22
150  1         12 

then I would work out the time intervals 60-40=20 and 150-130=20.
Now, thanks to the help I got from the list, I am able to calculate the 
distribution of contact and interval durations between IDs (something simpler than the conditional 
time interval). See the code at the end of the email which you can run on the first dataset I provide in this email to get 
the contact/interval distributions. 
Sorry for the long email, but any suggestion about how to calculate the conditional probability efficiently would help me a great deal.
Many thanks

Lorenzo


#!/usr/bin/env python
import scipy as s
import pylab as p
import numpy as n
import sys
import string


def single_tag_contact_times(sliced_data, tag_id):

 
    #I can follow a given tag by selecting his ID number and looking
    #for it through the data 
    
    sel=s.where(sliced_data==tag_id)

    #now I need to add a condition in case the id of the tag I have chosen is non-existing
  
    
    if (len(sel[0])==0):
        #print "the chosen tag does not exist"
        return
    
    tag_contact_times=sliced_data[sel[0],0] #I select the times at which
    #the tag I am tracking undergoes a contact.

   

    tag_no_rep=n.unique1d(tag_contact_times) #The idea is the following:
    #in a given time interval delta_slice, a tag may undergo multiple contacts
    #with different tags. This corresponds to different entries in the output
    #of time_binned_interaction. That function does not allow for multiple contacts between
    # the SAME two tags being reported in the same time slice, but it allows the same tag ID
    #to appear twice in the same time window if it gets in touch with TWO different tags
    #within the same delta_slice. It is fundamental to know that in a given time slice
    #tag A has estabilished contact with tag B and tag C (if I discard any bit of this info,
    #then I lose info about the state of the network at that time slice), but when it comes to
    #simply having the time-dependent distribution of contact durations and intervals between
    #any two contacts estabilished by packet A, I will simply say that tag A reported a contact
    #in that given time-slice. More sophisticated statistics (e.g. the number of contacts
    #estabilished by tag A in a given time slice), can be implemented if found useful/needed
    #later on.



    #p.save("single_tag_contact_times_no_rep.dat",tag_no_rep,fmt='%d')

    return  tag_no_rep


def contact_duration_and_interval_many_tags(sliced_interactions,\
                                         delta_slice, counter):

    #I added this line since now there is no guarantee that in the edge list
    # (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.

    sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])

    #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
    #thus outputting the distribution of time intervals between any two contacts in the system.

    tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
    #all tag ID`s, which appear (repeated) on two rows of the matrix output by 
    # time_binned_interaction


    #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')

    
    # tag_ids=tag_ids.astype('int')

    

    #print "tag_ids is, ", tag_ids

    overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
    #contacts for all the tags in the system.



    overall_duration=s.zeros(0) #this array will contain the time duration of the
    #contacts for all the tags in the system.



    for i in xrange(len(tag_ids)): 
        track_tag_id=tag_ids[i]  #i.e. iterate on all tags
        
        contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
        #an array with all the interactions of a given tag

        #print "contact_times is, ", contact_times

        results=contact_duration_and_interval_single_tag(contact_times, delta_slice)

        tag_duration=results[0]

        
        tag_intervals=results[1] #get
        #an array with the time intervals between two contacts for a given tag

        
        #print "tag_intervals is, ", tag_intervals
        
        overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
        #the results on all tags
        
        
        #print "overall_gaps is, ", overall_gaps

        overall_duration=s.hstack((overall_duration,tag_duration))

    #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
    #overall_duration=overall_duration[s.where(overall_duration !=0)]
    filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
    filename=filename+"_.dat"

    n.savetxt(filename, overall_gaps ,  fmt='%d')

    filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
    filename=filename+"_.dat"

    
    n.savetxt(filename, overall_duration ,  fmt='%d')

    return overall_duration, overall_gaps



def contact_duration_and_interval_many_tags(sliced_interactions,\
                                         delta_slice, counter):

    #I added this line since now there is no guarantee that in the edge list
    # (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.

    sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])

    #This function iterates interval_between_contacts_single_tag on a all the tag ID`s
    #thus outputting the distribution of time intervals between any two contacts in the system.

    tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
    #all tag ID`s, which appear (repeated) on two rows of the matrix output by 
    # time_binned_interaction


    #n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')

    
    # tag_ids=tag_ids.astype('int')

    

    #print "tag_ids is, ", tag_ids

    overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
    #contacts for all the tags in the system.



    overall_duration=s.zeros(0) #this array will contain the time duration of the
    #contacts for all the tags in the system.



    for i in xrange(len(tag_ids)): 
        track_tag_id=tag_ids[i]  #i.e. iterate on all tags
        
        contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
        #an array with all the interactions of a given tag

        #print "contact_times is, ", contact_times

        results=contact_duration_and_interval_single_tag(contact_times, delta_slice)

        tag_duration=results[0]

        
        tag_intervals=results[1] #get
        #an array with the time intervals between two contacts for a given tag

        
        #print "tag_intervals is, ", tag_intervals
        
        overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
        #the results on all tags
        
        
        #print "overall_gaps is, ", overall_gaps

        overall_duration=s.hstack((overall_duration,tag_duration))

    #overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
    #overall_duration=overall_duration[s.where(overall_duration !=0)]
    filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
    filename=filename+"_.dat"

    n.savetxt(filename, overall_gaps ,  fmt='%d')

    filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
    filename=filename+"_.dat"

    
    n.savetxt(filename, overall_duration ,  fmt='%d')

    return overall_duration, overall_gaps


def contact_duration_and_interval_single_tag(single_tag_no_rep, delta_slice):

    #the following if condition is useful only when I am really tracking a particular
    #tag whose ID is given a priory but which may not exist at all (in the sense that
    #it would not estabilish any contact) in the time window during which I am studying
    #the system.


    if (single_tag_no_rep==None):
        print "The chosen tag does not exist hence no analysis can be performed on it"
        return



    # delta_slice=int(delta_slice) #I do not need floating point arithmetic

    single_tag_no_rep=(single_tag_no_rep-single_tag_no_rep[0])/delta_slice
    gaps=s.diff(single_tag_no_rep) #a bit more efficient than the line above

    #print "gaps is, ", gaps

    #gaps is now an array of integers. It either has a list of consecutive 1`s
    # (which means a contact duration of delta_slice times the number of consecutive ones)
    # of an entry higher than one which expresses (in units of delta_slice) the time during
    #which the tag underwent no contact
    

    #p.save("gaps.dat",gaps, fmt='%d')

    # find_gap=s.where(gaps != 1)[0]

    find_gap=s.where(gaps > 1)[0] #a better definition: a tag may estabilish
    #several contacts within the same timeslice. So I may have some zeros in
    #gaps due to different simultaneous contacts. a tag is truly disconnected
    #from all the others when I see an increment larger than  one in the
    #rescaled time.


    gap_distr=(gaps[find_gap]-1)*delta_slice  #so, this is really the list of the
    #time interval between two contacts for my tag. After the discussion with Ciro,
    #I modified slightly the definition (now there is a -1) in the definition.
    #It probably does not matter much for the calculated distribution.

    #print "gap_distr is, ", gap_distr
    #NB: the procedure above does NOT break down is gap_distr is empty


    #Now I calculate the duration of the contacts of my tag. I changed this bit since
    #I had new discussions with Ciro

    #single_tag_no_rep=s.hstack((0,single_tag_no_rep))

    #print "single_tag_no_rep is, ", single_tag_no_rep, "and its length is, ", len(single_tag_no_rep)
    
    # e2=s.diff(single_tag_no_rep)

    # #print "e2 is, ", e2
    
    # sel=s.where(e2!=1)[0]
    # #print "sel is, ", sel

    #sel=s.where(gaps!=1)[0]

    # res=0 #this will contain the results and will be overwritten

    #What is here needs to be tested very carefully! There may be some bugs


    sol=s.hstack((0,find_gap,len(gaps)))
    #print "sol is, ", sol

        
    
    res=s.diff(sol)
    #print "res initially is, ", res


    res[0]=res[0]+1 #to account for troubles I normally have at the beginning of the array

    #print "res is, ", res
        
        


    res=res*delta_slice


    #print "the sum of all the durations is, ", res.sum()

    return [res,gap_distr] 


f = open(sys.argv[1])
sliced_interactions = [map(int, string.split(line)) for line in f.readlines()]
f.close()

print ("sliced_interactions is, ", sliced_interactions)

sliced_interactions = s.array(sliced_interactions, dtype="int64")

print ("sliced_interactions is now, ", sliced_interactions)

counter=0

delta_slice=1

contact_duration_and_interval_many_tags(sliced_interactions,\
                                            delta_slice,counter)






More information about the SciPy-User mailing list