[SciPy-user] mid-size outerproduct too big for memory
Thomas Davidoff
davidoff at haas.berkeley.edu
Sun Apr 24 22:51:17 CDT 2005
Thanks for the help.
The code (attached) is going to reek of poor programming skills but I
think the dimensions on line 158 (where the program stops) are
conformable. I can't attach the datasets - too big. If you are truly
curious, the .uf3 files are at census.gov under sf3 ftp page.
As for the other processes, I don't have mucnh of anything else going
on. Running top on my remote mac leads to
Processes: 55 total, 2 running, 53 sleeping... 137 threads
20:48:38
Load Avg: 0.00, 0.00, 0.00 CPU usage: 1.3% user, 4.8% sys, 93.8% idle
SharedLibs: num = 134, resident = 60.2M code, 4.59M data, 18.3M LinkEdit
MemRegions: num = 12317, resident = 124M + 10.5M private, 89.8M shared
PhysMem: 233M wired, 163M active, 207M inactive, 604M used, 1.91G free
VM: 5.08G + 95.1M 259046(0) pageins, 11945(0) pageouts
Robert Kern wrote:
> Thomas Davidoff wrote:
>
>> I tried to get the outerproduct(a,b) where a is a 23,297 by 15 matrix
>> and b is a 15 element row vector. This led to a memory error. This
>> is a pretty small dataset for empirical economics - is there a way
>> around this?
>
>
> 23297*15*15*8./1024/1024 = 40
>
> 40 MB. My G4-1GB memory handles this fine. I'm not sure why you are
> seeing a memory error. Could you post your code?
>
>> Will I ever be able to do something like 1 million by 100 or is it
>> back to Stata for me?
>
>
> 1000000*100*100*8/1024/1024/1024 == 74
>
> 74 GB. No, I don't think your machine will handle that outerproduct well.
>
>> I don't think the issue is other stuff in the program because nothing
>> should be eating a lot of memory.
>
>
> How about other stuff running on the system?
>
>> I noticed I get the same problem in command line python just trying a
>> 100,000 by 100 matrix and a 1 by 100 row vector.
>> I am on a new G5 PowerMac with lots of new memory installed.
>
>
--
Thomas Davidoff
Assistant Professor
Haas School of Business
UC Berkeley
Berkeley, CA 94618
Phone: (510) 643-1425
Fax: (510) 643-7357
email: davidoff at haas.berkeley.edu
web: http://faculty.haas.berkeley.edu/davidoff/
-------------- next part --------------
# jueround2.py
# Python program to do sorting paper work
# Tom Davidoff
# 4/20/200o
# Start date: Wed Apr 20 16:17:34 PDT 2005
# How long?
import time
starttime = time.time()
#Packages
from scipy import *
from rpy import *
import re
# Make empty vectors of logrecno for zipcode, lat and lon
logvec = []
latvec = []
lonvec = []
zipvec = []
mcdvec = [] # MCD is minor civil division
f = open('data/sorting/mageo.uf3','r') # Census 2000 geographic file
# Ensure only zips in Boston MSA - use census matching file
zipfile = open('data/sorting/zcta_to_99msa_rel.txt', 'r')
msa = []
mcd = [] # for knowing what town
zip = []
for line in zipfile:
line=line.split("\t")
if re.search('1122',line[0]) and len(line[4])>0 :
msa.append(float(line[0]))
mcd.append(line[4])
try:
zip.append(float(line[5]))
except:
ValueError
for line in f:
try:
zipcode = float(line[160:165])
ziprow = zip.index(zipcode) # Find the right row of msa/zip vectors
except ValueError:
ziprow = 1 # first row of zips file is not boston
if re.search('ZCTA',line) and re.search('5-Digit',line) and re.search('1122', str(msa[ziprow])): # Only want zipcodes
zipvec.append(float(zipcode))
mcdvec.append(float(mcd[ziprow]))
logvec.append(float(line[18:25]))
latvec.append(float(line[310:319]))
lonvec.append(float(line[319:329]))
else:
pass
f.close
g = open('data/sorting/ma00006.uf3') #looking for data on income: p52, found in MA00006.uf3, should be records following 5 intro plus 65 - start at record 70 for all households 17 values: all households and 1 income:
#Households
#Less than $10,000
#$10,000 to $14,999
#$15,000 to $19,999
#$20,000 to $24,999
#$25,000 to $29,999
#$30,000 to $34,999
#$35,000 to $39,999
#$40,000 to $44,999
#$45,000 to $49,999
#$50,000 to $59,999
#$60,000 to $74,999
#$75,000 to $99,999
#$100,000 to $124,999
#$125,000 to $149,999
#$150,000 to $199,999
#$200,000 or more
mcdvec2 = [] # New vectors merging logrecno,mcd,lon, and lat with income file ma0006.uf3
logvec2 = []
latvec2 = []
lonvec2 = []
incomepoor = [] # income of poorest 15 income groups - will be L x 15
incomerich = [] # income of just 16th group - will be L x 1
meanrich = [] # Average income for upper earners varies by jurisdiction divide element [90] by 87 [86] - former is total income of highest category
householdvec = [] # Will be L x 1 vector of all households (inverted so it's a weighting factor)
for line in g:
line = line.split(',')
logrecno = int(line[4]) # Find the row number of this logrecno in the geo file. Delete the few that don't have any income
try:
incomerow=[]
allzero = 0
for i in range(71,86): # Through record 86
incomerow.append(float(line[i]))
allzero=allzero + float(line[i]) # add to zero income test
if allzero > 0:
incomepoor.append(incomerow) # 15 element row of income
incomerich.append(float(line[86]))
finder = logvec.index(logrecno)
logvec2.append(logvec[finder])
latvec2.append(latvec[finder])
lonvec2.append(lonvec[finder])
mcdvec2.append(mcdvec[finder])
if float(line[86])>0:
meanrich.append(log(float(line[90]))/float(line[86])) # ln income
else:
meanrich.append(200000)
householdvec.append(1/float(line[70]))
except ValueError:
pass
g.close
print mean(r_[latvec2])
print mean(lonvec2)
print median(latvec2)
print min(latvec)
latvec2 = r_[latvec2]
lonvec2 = r_[lonvec2]
# Loop to get 6 neighbors and their different jurisdiction status
neighbors = [] # list of names will be 6Xnobs
samejur = []
for i in range(1,7):
neighbors.append([])
samejur.append([])
#Make the matrix of distances from neighbors and are they different jurisdictions
#Less efficient than doing distances and incomes at the same time, but easier to follow program
for i in range(0,len(logvec2)):
distancevec = r_[(latvec2-latvec2[i])**2 + (lonvec2-lonvec2[i])**2] # Euclidean distance
distancelist = []
for row in range(0,len(distancevec)): #Convert distance into a python list
distancelist.append(distancevec[row])
sorted = sort(distancelist)
for rank in range(0,6): # This is n1-n6
ndist = sorted[rank+1] # Needs to skip 0
nrow = distancelist.index(ndist)
neighbors[rank].append(logvec2[nrow])
samejurtest = (mcdvec2[nrow]==mcdvec2[i]) # Test for same jurisdiction or not
samejur[rank].append(samejurtest)
print mean(samejur)
# Now make the matrix of income variations
# First need 6 neighbors income matrices - each has 16 vectors - 1 for each of 16 income levels
neighborincomepoor = []
for i in range(0,6):
neighborincomepoor.append([])
for j in range(0,16):
neighborincomepoor[i].append([])
for i in range(0,len(logvec2)): #Income and longitude vectors *should* have the same length
for rank in range(0,6):
neighborcolumn = neighbors[rank]
nrow = logvec2.index(neighborcolumn[i])
neighborincomepoor[rank].append(incomepoor[nrow])
# Calculate MSA variance of income
# First calculate mean and variance within each zipcode
lnincomevec = outerproduct(ones(len(meanrich)), r_[log(5000), log(12500), log(17500), log(22500), log(27950), log(32500), log(37500), log(42500), log(47500), log(55000), log(67500), log(87500), log(112500), log(137500), log(175000)]) # only have automated incomes for the first 15 - 16th category is for highest - calc by hand
print len(incomepoor)
print len(incomerich)
print len(meanrich)
print len(outerproduct(incomerich,meanrich))
totalincomevec = outerproduct(incomepoor, transpose(lnincomevec)) + outerproduct(incomerich,transpose(meanrich)) # L x 1
meanincomevec = outerproduct(diag(totalincomevec),householdvec) # Want L x 1 -> do LxL version of meanincome elements on diag, X L x 1)
print "Mean income is", mean(meanincome)
# Dot product loop is slow - this should be matrix multiplication
# income is 16 x
for i in range(0,len(income)):
totalincome = dot(r_[incomepoor[i]],r_[lnincomevec,meanrich[i]])
meanincome = totalincome/sum(income[i])
errors = r_[income[i]]-meanincome
squarederrors = power(errors,2)
variancevec.append(sum(squarederrors)/(sum(income)-1))
print mean(variancevec)
# How long did it take?
print "This is how long the program took: ", time.time() - starttime
More information about the SciPy-user
mailing list