[SciPy-user] mid-size outerproduct too big for memory

Thomas Davidoff davidoff at haas.berkeley.edu
Sun Apr 24 22:51:17 CDT 2005


Thanks for the help.
The code (attached) is going to reek of poor programming skills but I 
think the dimensions on line 158 (where the program stops) are 
conformable.  I can't attach the datasets - too big.  If you are truly 
curious, the .uf3 files are at census.gov under sf3 ftp page.
As for the other processes, I don't have mucnh of anything else going 
on.  Running top on my remote mac leads to
Processes:  55 total, 2 running, 53 sleeping... 137 threads            
20:48:38
Load Avg:  0.00, 0.00, 0.00     CPU usage:  1.3% user, 4.8% sys, 93.8% idle
SharedLibs: num =  134, resident = 60.2M code, 4.59M data, 18.3M LinkEdit
MemRegions: num = 12317, resident =  124M + 10.5M private, 89.8M shared
PhysMem:   233M wired,  163M active,  207M inactive,  604M used, 1.91G free
VM: 5.08G + 95.1M   259046(0) pageins, 11945(0) pageouts

Robert Kern wrote:

> Thomas Davidoff wrote:
>
>> I tried to get the outerproduct(a,b) where a is a 23,297 by 15 matrix 
>> and b is a 15 element row vector.  This led to a memory error.  This 
>> is a pretty small dataset for empirical economics - is there a way 
>> around this? 
>
>
> 23297*15*15*8./1024/1024 = 40
>
> 40 MB. My G4-1GB memory handles this fine. I'm not sure why you are 
> seeing a memory error. Could you post your code?
>
>> Will I ever be able to do something like 1 million by 100 or is it 
>> back to Stata for me?
>
>
> 1000000*100*100*8/1024/1024/1024 == 74
>
> 74 GB. No, I don't think your machine will handle that outerproduct well.
>
>> I don't think the issue is other stuff in the program because nothing 
>> should be eating a lot of memory. 
>
>
> How about other stuff running on the system?
>
>> I noticed I get the same problem in command line python just trying a 
>> 100,000 by 100 matrix and a 1 by 100 row vector.
>> I am on a new G5 PowerMac with lots of new memory installed.
>
>

-- 
Thomas Davidoff
Assistant Professor
Haas School of Business
UC Berkeley
Berkeley, CA 94618
Phone:	(510) 643-1425
Fax:	(510) 643-7357
email:	davidoff at haas.berkeley.edu
web:	http://faculty.haas.berkeley.edu/davidoff/

-------------- next part --------------
# jueround2.py
# Python program to do sorting paper work
# Tom Davidoff
# 4/20/200o
# Start date: Wed Apr 20 16:17:34 PDT 2005

# How long?
import time
starttime = time.time()

#Packages
from scipy import *
from rpy import *
import re

# Make empty vectors of logrecno for zipcode, lat and lon
logvec = []
latvec = []
lonvec = []
zipvec = []
mcdvec = [] # MCD is minor civil division 
f = open('data/sorting/mageo.uf3','r') # Census 2000 geographic file

# Ensure only zips in Boston MSA - use census matching file 
zipfile = open('data/sorting/zcta_to_99msa_rel.txt', 'r')
msa = []
mcd = [] # for knowing what town
zip = []
for line in zipfile:
	line=line.split("\t")
	if re.search('1122',line[0]) and len(line[4])>0 :
		msa.append(float(line[0]))
		mcd.append(line[4])
		try:
			zip.append(float(line[5])) 
		except: 
			ValueError
for line in f:
	try:
		zipcode = float(line[160:165])
		ziprow = zip.index(zipcode) # Find the right row of msa/zip vectors
	except ValueError:
		ziprow = 1	# first row of zips file is not boston
	if re.search('ZCTA',line) and re.search('5-Digit',line) and re.search('1122', str(msa[ziprow])):  # Only want zipcodes
		zipvec.append(float(zipcode))
		mcdvec.append(float(mcd[ziprow]))
		logvec.append(float(line[18:25]))
		latvec.append(float(line[310:319]))
		lonvec.append(float(line[319:329]))
	else:
		pass
f.close

g = open('data/sorting/ma00006.uf3') #looking for data on income: p52, found in MA00006.uf3, should be records following 5 intro plus 65 - start at record 70 for all households 17 values: all households and 1 income: 
#Households
#Less than $10,000
#$10,000 to $14,999
#$15,000 to $19,999
#$20,000 to $24,999
#$25,000 to $29,999
#$30,000 to $34,999
#$35,000 to $39,999
#$40,000 to $44,999
#$45,000 to $49,999
#$50,000 to $59,999
#$60,000 to $74,999
#$75,000 to $99,999
#$100,000 to $124,999
#$125,000 to $149,999
#$150,000 to $199,999
#$200,000 or more

mcdvec2 = [] # New vectors merging logrecno,mcd,lon, and lat with income file ma0006.uf3
logvec2 = []
latvec2 = []
lonvec2 = []
incomepoor = []  # income of poorest 15 income groups - will be L x 15
incomerich = [] # income of just 16th group - will be L x 1
meanrich = [] # Average income for upper earners varies by jurisdiction divide element [90] by 87 [86] - former is total income of highest category
householdvec = [] # Will be L x 1 vector of all households (inverted so it's a weighting factor)
for line in g:
	line = line.split(',')
	logrecno = int(line[4])   # Find the row number of this logrecno in the geo file.  Delete the few that don't have any income
	try:
		incomerow=[]
		allzero = 0
		for i in range(71,86): # Through record 86 
			incomerow.append(float(line[i]))
			allzero=allzero + float(line[i]) # add to zero income test
		if allzero > 0:
			incomepoor.append(incomerow)  # 15 element row of income
			incomerich.append(float(line[86]))
			finder = logvec.index(logrecno)
			logvec2.append(logvec[finder])
			latvec2.append(latvec[finder])
			lonvec2.append(lonvec[finder])
			mcdvec2.append(mcdvec[finder])
			if float(line[86])>0:
				meanrich.append(log(float(line[90]))/float(line[86])) # ln income
			else:
				meanrich.append(200000)
			householdvec.append(1/float(line[70]))
	except ValueError:
		pass
g.close
print mean(r_[latvec2])
print mean(lonvec2)
print median(latvec2)
print min(latvec)

latvec2 = r_[latvec2]
lonvec2 = r_[lonvec2]
# Loop to get 6 neighbors and their different jurisdiction status
neighbors = []	# list of names will be 6Xnobs
samejur = []
for i in range(1,7):
	neighbors.append([])
	samejur.append([])

#Make the matrix of distances from neighbors and are they different jurisdictions
#Less efficient than doing distances and incomes at the same time, but easier to follow program
for i in range(0,len(logvec2)):
	distancevec = r_[(latvec2-latvec2[i])**2 + (lonvec2-lonvec2[i])**2] # Euclidean distance
	distancelist = [] 
	for row in range(0,len(distancevec)): #Convert distance into a python list
		distancelist.append(distancevec[row])
	sorted = sort(distancelist)
	for rank in range(0,6):	# This is n1-n6 
		ndist = sorted[rank+1]	# Needs to skip 0
		nrow = distancelist.index(ndist)
		neighbors[rank].append(logvec2[nrow])
		samejurtest = (mcdvec2[nrow]==mcdvec2[i]) # Test for same jurisdiction or not
		samejur[rank].append(samejurtest)
print mean(samejur)

# Now make the matrix of income variations
# First need 6 neighbors income matrices - each has 16 vectors - 1 for each of 16 income levels
neighborincomepoor = []
for i in range(0,6):
	neighborincomepoor.append([])
	for j in range(0,16):
		neighborincomepoor[i].append([])

for i in range(0,len(logvec2)):  #Income and longitude vectors *should* have the same length
	for rank in range(0,6):
		neighborcolumn = neighbors[rank]
		nrow = logvec2.index(neighborcolumn[i])
		neighborincomepoor[rank].append(incomepoor[nrow])

# Calculate MSA variance of income
# First calculate mean and variance within each zipcode
lnincomevec = outerproduct(ones(len(meanrich)), r_[log(5000), log(12500), log(17500), log(22500), log(27950), log(32500), log(37500), log(42500), log(47500), log(55000), log(67500), log(87500), log(112500), log(137500), log(175000)]) # only have automated incomes for the first 15 - 16th category is for highest - calc by hand
print len(incomepoor)
print len(incomerich)
print len(meanrich)
print len(outerproduct(incomerich,meanrich))
totalincomevec = outerproduct(incomepoor, transpose(lnincomevec)) + outerproduct(incomerich,transpose(meanrich)) # L x 1
meanincomevec = outerproduct(diag(totalincomevec),householdvec)   # Want L x 1 ->  do LxL version of meanincome elements on diag, X L x 1)
print "Mean income is", mean(meanincome)
# Dot product loop is slow - this should be matrix multiplication
# income is 16 x
for i in range(0,len(income)):
	totalincome = dot(r_[incomepoor[i]],r_[lnincomevec,meanrich[i]])
	meanincome = totalincome/sum(income[i])
	errors = r_[income[i]]-meanincome
	squarederrors = power(errors,2)
	variancevec.append(sum(squarederrors)/(sum(income)-1))
print mean(variancevec)
		
# How long did it take?
print "This is how long the program took: ", time.time() - starttime



More information about the SciPy-user mailing list