[SciPy-User] Again on Calculating (Conditional) Time Intervals

Lorenzo Isella lorenzo.isella@gmail....
Tue Jun 1 10:16:35 CDT 2010

```Dear All,
I hope this is not too off-topic. I have digged up an old email I posted which went unanswered quite some time ago.
I made some progress on a simpler problem than the one for which I initially asked for help and I am attaching at the end of the email
my own scripts. I anyone can help me to progress a bit further, I will be very grateful.
Consider and array of this kind:

1      12    45
2      7       12
2      15     37
3      25     89
3      8       13
3      13     44
4      77     89
4      77     89
5      12     22
8      12     22
9      15     22
11    22     37
23    3       12
24    18     37
25    1      12

where the first column is time measured in some units. The other two
columns are some ID's identifying  infected individuals establishing a
contact at the corresponding time. As you can see, there may be
time-gaps in my recorded times and there may be repeated times if
several contacts take place simultaneously. The ID's are always sorted
out in such a way the ID number of the 2nd column is always smaller than
the corresponding entry of the third column (I am obviously indexing
everything from 1).
Now, this is my problem: I want to look at a specific ID I will call A
(let us say A is 12) and calculate all the time differences t_AC-t_AB
for B!=C, i.e. all the time intervals between the most recent contact
between A and B and the first subsequent contact between and A and C
(which has to be different from B).
An example to fix the ideas: A=12, B=22, C=1, then
t_AB=8 (pick the most recent one before t_AC)
t_AC=25,
hence t_AC-t_AB=25-8=17. (but let me say it again: I want to be able to
calculate all such intervals for any B and C on the fly).
It should be clear at this point that the calculated t_AC-t_AB !=
t_AB-t_AC as some time-ordering is implicit in the definition (in
t_AC-t_AB, AC contacts have to always be more recent than AB contacts).
Even in the case of multiple disjointed AB and AC contacts, I always
have to look for the closest time intervals in time. E.g. if I had

10    12       22
40    12       22
60    1         12
100  1         12
110  12       22
130  12       22
150  1         12

then I would work out the time intervals 60-40=20 and 150-130=20.
Now, thanks to the help I got from the list, I am able to calculate the
distribution of contact and interval durations between IDs (something simpler than the conditional
time interval). See the code at the end of the email which you can run on the first dataset I provide in this email to get
the contact/interval distributions.
Sorry for the long email, but any suggestion about how to calculate the conditional probability efficiently would help me a great deal.
Many thanks

Lorenzo

#!/usr/bin/env python
import scipy as s
import pylab as p
import numpy as n
import sys
import string

def single_tag_contact_times(sliced_data, tag_id):

#I can follow a given tag by selecting his ID number and looking
#for it through the data

sel=s.where(sliced_data==tag_id)

#now I need to add a condition in case the id of the tag I have chosen is non-existing

if (len(sel[0])==0):
#print "the chosen tag does not exist"
return

tag_contact_times=sliced_data[sel[0],0] #I select the times at which
#the tag I am tracking undergoes a contact.

tag_no_rep=n.unique1d(tag_contact_times) #The idea is the following:
#in a given time interval delta_slice, a tag may undergo multiple contacts
#with different tags. This corresponds to different entries in the output
#of time_binned_interaction. That function does not allow for multiple contacts between
# the SAME two tags being reported in the same time slice, but it allows the same tag ID
#to appear twice in the same time window if it gets in touch with TWO different tags
#within the same delta_slice. It is fundamental to know that in a given time slice
#tag A has estabilished contact with tag B and tag C (if I discard any bit of this info,
#then I lose info about the state of the network at that time slice), but when it comes to
#simply having the time-dependent distribution of contact durations and intervals between
#any two contacts estabilished by packet A, I will simply say that tag A reported a contact
#in that given time-slice. More sophisticated statistics (e.g. the number of contacts
#estabilished by tag A in a given time slice), can be implemented if found useful/needed
#later on.

#p.save("single_tag_contact_times_no_rep.dat",tag_no_rep,fmt='%d')

return  tag_no_rep

def contact_duration_and_interval_many_tags(sliced_interactions,\
delta_slice, counter):

#I added this line since now there is no guarantee that in the edge list
# (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.

sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])

#This function iterates interval_between_contacts_single_tag on a all the tag ID`s
#thus outputting the distribution of time intervals between any two contacts in the system.

tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
#all tag ID`s, which appear (repeated) on two rows of the matrix output by
# time_binned_interaction

#n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')

# tag_ids=tag_ids.astype('int')

#print "tag_ids is, ", tag_ids

overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
#contacts for all the tags in the system.

overall_duration=s.zeros(0) #this array will contain the time duration of the
#contacts for all the tags in the system.

for i in xrange(len(tag_ids)):
track_tag_id=tag_ids[i]  #i.e. iterate on all tags

contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
#an array with all the interactions of a given tag

#print "contact_times is, ", contact_times

results=contact_duration_and_interval_single_tag(contact_times, delta_slice)

tag_duration=results[0]

tag_intervals=results[1] #get
#an array with the time intervals between two contacts for a given tag

#print "tag_intervals is, ", tag_intervals

overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
#the results on all tags

#print "overall_gaps is, ", overall_gaps

overall_duration=s.hstack((overall_duration,tag_duration))

#overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
#overall_duration=overall_duration[s.where(overall_duration !=0)]
filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
filename=filename+"_.dat"

n.savetxt(filename, overall_gaps ,  fmt='%d')

filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
filename=filename+"_.dat"

n.savetxt(filename, overall_duration ,  fmt='%d')

return overall_duration, overall_gaps

def contact_duration_and_interval_many_tags(sliced_interactions,\
delta_slice, counter):

#I added this line since now there is no guarantee that in the edge list
# (contact list) tag_A---tag_B, the id of tag_A is <= id of tag_B.

sliced_interactions[:,1:3]=s.sort(sliced_interactions[:,1:3])

#This function iterates interval_between_contacts_single_tag on a all the tag ID`s
#thus outputting the distribution of time intervals between any two contacts in the system.

tag_ids= n.unique1d(s.ravel(sliced_interactions[:,1:3])) #to get a list of
#all tag ID`s, which appear (repeated) on two rows of the matrix output by
# time_binned_interaction

#n.savetxt("tag_IDs.dat", tag_ids ,  fmt='%d')

# tag_ids=tag_ids.astype('int')

#print "tag_ids is, ", tag_ids

overall_gaps=s.zeros(0) #this array will contain the time intervals between two consecutive
#contacts for all the tags in the system.

overall_duration=s.zeros(0) #this array will contain the time duration of the
#contacts for all the tags in the system.

for i in xrange(len(tag_ids)):
track_tag_id=tag_ids[i]  #i.e. iterate on all tags

contact_times=single_tag_contact_times(sliced_interactions, track_tag_id) #get
#an array with all the interactions of a given tag

#print "contact_times is, ", contact_times

results=contact_duration_and_interval_single_tag(contact_times, delta_slice)

tag_duration=results[0]

tag_intervals=results[1] #get
#an array with the time intervals between two contacts for a given tag

#print "tag_intervals is, ", tag_intervals

overall_gaps=s.hstack((overall_gaps,tag_intervals)) #collect
#the results on all tags

#print "overall_gaps is, ", overall_gaps

overall_duration=s.hstack((overall_duration,tag_duration))

#overall_gaps=overall_gaps[s.where(overall_gaps !=0)]
#overall_duration=overall_duration[s.where(overall_duration !=0)]
filename="many_tags_contact_interval_distr2_%01d"%(counter+1)
filename=filename+"_.dat"

n.savetxt(filename, overall_gaps ,  fmt='%d')

filename="many_tags_contact_duration_distr2_%01d"%(counter+1)
filename=filename+"_.dat"

n.savetxt(filename, overall_duration ,  fmt='%d')

return overall_duration, overall_gaps

def contact_duration_and_interval_single_tag(single_tag_no_rep, delta_slice):

#the following if condition is useful only when I am really tracking a particular
#tag whose ID is given a priory but which may not exist at all (in the sense that
#it would not estabilish any contact) in the time window during which I am studying
#the system.

if (single_tag_no_rep==None):
print "The chosen tag does not exist hence no analysis can be performed on it"
return

# delta_slice=int(delta_slice) #I do not need floating point arithmetic

single_tag_no_rep=(single_tag_no_rep-single_tag_no_rep[0])/delta_slice
gaps=s.diff(single_tag_no_rep) #a bit more efficient than the line above

#print "gaps is, ", gaps

#gaps is now an array of integers. It either has a list of consecutive 1`s
# (which means a contact duration of delta_slice times the number of consecutive ones)
# of an entry higher than one which expresses (in units of delta_slice) the time during
#which the tag underwent no contact

#p.save("gaps.dat",gaps, fmt='%d')

# find_gap=s.where(gaps != 1)[0]

find_gap=s.where(gaps > 1)[0] #a better definition: a tag may estabilish
#several contacts within the same timeslice. So I may have some zeros in
#gaps due to different simultaneous contacts. a tag is truly disconnected
#from all the others when I see an increment larger than  one in the
#rescaled time.

gap_distr=(gaps[find_gap]-1)*delta_slice  #so, this is really the list of the
#time interval between two contacts for my tag. After the discussion with Ciro,
#I modified slightly the definition (now there is a -1) in the definition.
#It probably does not matter much for the calculated distribution.

#print "gap_distr is, ", gap_distr
#NB: the procedure above does NOT break down is gap_distr is empty

#Now I calculate the duration of the contacts of my tag. I changed this bit since
#I had new discussions with Ciro

#single_tag_no_rep=s.hstack((0,single_tag_no_rep))

#print "single_tag_no_rep is, ", single_tag_no_rep, "and its length is, ", len(single_tag_no_rep)

# e2=s.diff(single_tag_no_rep)

# #print "e2 is, ", e2

# sel=s.where(e2!=1)[0]
# #print "sel is, ", sel

#sel=s.where(gaps!=1)[0]

# res=0 #this will contain the results and will be overwritten

#What is here needs to be tested very carefully! There may be some bugs

sol=s.hstack((0,find_gap,len(gaps)))
#print "sol is, ", sol

res=s.diff(sol)
#print "res initially is, ", res

res[0]=res[0]+1 #to account for troubles I normally have at the beginning of the array

#print "res is, ", res

res=res*delta_slice

#print "the sum of all the durations is, ", res.sum()

return [res,gap_distr]

f = open(sys.argv[1])
sliced_interactions = [map(int, string.split(line)) for line in f.readlines()]
f.close()

print ("sliced_interactions is, ", sliced_interactions)

sliced_interactions = s.array(sliced_interactions, dtype="int64")

print ("sliced_interactions is now, ", sliced_interactions)

counter=0

delta_slice=1

contact_duration_and_interval_many_tags(sliced_interactions,\
delta_slice,counter)

```