[Scipy-tickets] [SciPy] #1582: APPCRASH error when trying to use Latent Semantic Indexing method with Gensim

SciPy Trac scipy-tickets@scipy....
Fri Jan 20 07:17:54 CST 2012

#1582: APPCRASH error when trying to use Latent Semantic Indexing method with
 Reporter:  Shahab  |       Owner:  somebody
     Type:  defect  |      Status:  new     
 Priority:  high    |   Milestone:  0.11.0  
Component:  Other   |     Version:  0.10.0  
 Keywords:          |  
Changes (by radim):

 * cc: radimrehurek@… (added)


 Hi guys, I came across this ticket when googling gensim. In my opinion
 (author of gensim), this is not a bug in scipy per se. In both Shahab's
 and inverseofverse's cases, there is a mismatch between feature ids in
 their corpus and in their dictionary. In scipy terms, this means that some
 `scipy.sparse` operations are done with matrices where the claimed
 dimensionality doesn't match the actual entries. For example,
 inverseofverse's dictionary contains 160,989 ids, but the corpus has
 161,233 ids -- so some of the corpus entries contain invalid ids >=
 160,989. I believe this is the real cause of the crash that happens later,
 inside `sparsetools.csc_matvecs`, when the sparse C routines access memory
 they shouldn't.

 The fix is IMO:

 * mostly on the user side: make the dictionary match the corpus
 * and perhaps on gensim side: warn user if they try to supply such
 mismatched data.

 I guess detecting this on scipy side would cause a major slowdown, so
 that's not a good option. Also, if this is really the cause, then it has
 nothing to do with Windows, the crash will happen under any OS.

Ticket URL: <http://projects.scipy.org/scipy/ticket/1582#comment:6>
SciPy <http://www.scipy.org>
SciPy is open-source software for mathematics, science, and engineering.

More information about the Scipy-tickets mailing list