https://github.com/RaRe-Technologies/gensim

sort by:
Revision Author Date Message Commit Date
62fc3f7 Merge branch 'release-0.8.0' 28 June 2011, 19:31:19 UTC
e771488 improved final 0.8.0 documentation 28 June 2011, 19:27:22 UTC
b430c0b up version; preparing 0.8.0 final release 28 June 2011, 19:26:28 UTC
cc3c801 Merge branch 'develop' of github.com:piskvorky/gensim into develop 28 June 2011, 18:41:42 UTC
6ceb779 work around strange Pyro packaging (version numbers) * to be removed once the new Pyro (>=4.4) is integrated 28 June 2011, 18:39:01 UTC
2fd7f33 added alias any2utf8 for to_utf8 * and any2unicode for to_unicode 27 June 2011, 21:40:15 UTC
cacf66e Merge pull request #44 from dedan/develop fix the module import when linking to the git root instead of module 27 June 2011, 11:06:07 UTC
1a27732 fix the module import when linking to the git root instead of module for some application I need to link to the gensim folder which is also the root of the repository. This script helps python to find the actual sourcecode of the module and had to be changed because radim moved the source within the repo 27 June 2011, 08:41:49 UTC
dd20e05 fixed one PEP8 orphan 25 June 2011, 01:36:05 UTC
f9560e5 added google analytics to gensim website 22 June 2011, 16:22:06 UTC
c9dd9d3 updated docs for chunks->chunksize rename 22 June 2011, 16:20:01 UTC
29fd2ae Merge branch 'develop' of github.com:piskvorky/gensim into develop Conflicts: gensim/test/test_models.py 22 June 2011, 16:14:36 UTC
81ef3b5 Merge pull request #40 from Dieterbe/develop Rename variable "chunks" to more sensible "chunksize" 22 June 2011, 16:10:26 UTC
1b3891b Rename variable "chunks" to more sensible "chunksize" 22 June 2011, 15:54:25 UTC
917cd28 removed print_debug calls from the LSI unittest * was causing `invalid value in divide` warnings in numpy * see http://groups.google.com/group/gensim/browse_thread/thread/45c1c9efe91ce8d0 22 June 2011, 14:58:43 UTC
947d1f9 Merge branch 'release-0.8.0rc1' 19 June 2011, 23:39:33 UTC
0e4ad96 put Download above TOC on title page 19 June 2011, 23:36:32 UTC
2471feb up version: 0.8.0rc1 19 June 2011, 23:20:17 UTC
7c3e372 updated documentation for 0.8.0 release 19 June 2011, 23:18:54 UTC
63d61cc fixed length of (Sparse)MatrixSimilarity 19 June 2011, 23:13:49 UTC
0d1db95 improved doc strings 18 June 2011, 10:17:28 UTC
f40fd77 added mmap load/save to LsiModel 16 June 2011, 14:41:50 UTC
890dd7a Added chunking for lsi[corpus] transformation (about 3x faster) * before, lsi[corpus] was just syntactic sugar for (lsi[doc] for doc in corpus) * now, lsi[corpus] proceeds in chunks of documents (256 by default) and transforms each entire chunk at once * the reason is, transforming a chunk = matrix * matrix multiply, is faster than 256 single document transforms = matrix * vector multiplies (bc. of cache&co) 16 June 2011, 14:08:13 UTC
70cb2a5 Merge branch 'develop' of github.com:piskvorky/gensim into develop 15 June 2011, 01:46:19 UTC
885991a updated changelog and todo.txt for new release 15 June 2011, 01:42:20 UTC
73ed595 simplified dir structure: src/gensim/ -> gensim/ 15 June 2011, 01:42:20 UTC
58426c0 removed scipy 0.6 from supported versions 15 June 2011, 01:33:09 UTC
7ff5149 removed scipy 0.6 from supported versions 13 June 2011, 19:04:02 UTC
f90afc1 Merge branch 'sharding' into develop 13 June 2011, 17:46:46 UTC
e0932f8 added script for testing speed of Similarity 13 June 2011, 15:08:47 UTC
88f2a3b updated docs to reflect PEP8 changes * also fixed and updated several doc strings and comments, esp. docsim.py 13 June 2011, 15:08:47 UTC
482c73f added chunking to Similarity 13 June 2011, 14:59:52 UTC
4c5cf51 added unit tests for similarities * 1st working version of sharded Similarity 13 June 2011, 12:45:22 UTC
fe01d93 changed default LsiModel chunk size to 20k (was: 10k) 12 June 2011, 22:53:19 UTC
9bf3d05 removed threaded chunking * users reported problems and the speed gain was small... * now uses simple itertools.groupby to chunk again, like in 0.7.7 10 June 2011, 12:37:01 UTC
8fca994 mmap'ed (Sparse)MatrixSimilarity save/load + renamed .corpus to .index 09 June 2011, 09:59:55 UTC
6e5ed94 added sharding to similarity index 09 June 2011, 09:59:55 UTC
b59eb47 re #10: PEP8-fied function/variable names * backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc. * documentation to be updated in a separate commit 09 June 2011, 09:59:29 UTC
f564aa2 * backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc. 07 June 2011, 13:24:30 UTC
fd8c32a moved dmlcz to the `examples` subdirectory 07 June 2011, 13:24:30 UTC
df6f3e5 deleted old unused SVD algos 06 June 2011, 10:26:02 UTC
5610002 turn off threading in chunking by default * users reported problems (gensim stalling indefinitely, some deadlock?), http://groups.google.com/group/gensim/browse_thread/thread/c834e0c61eb50548 01 June 2011, 20:25:42 UTC
3e8ef71 simplified logic of vector/corpus overload in index[query] + added another speed test 25 May 2011, 18:12:46 UTC
81e933f unitVec returns scipy.sparse output for scipy.sparse input (was: returns dense numpy array) 25 May 2011, 17:50:29 UTC
1065626 added little memory optimization in scipy.sparse operations 21 May 2011, 19:43:22 UTC
628f30d improved documentation of utils fncs 20 May 2011, 13:56:53 UTC
d6974e9 allow unicode in filterWiki fnc (was: only utf8) 20 May 2011, 13:46:02 UTC
796f6b5 more efficient sparse matrix generation When the sparse properties (#documents, #terms, #non-zeroes) are known in advance, a much more efficient code path is taken. This is the case with MmCorpus, so pass a MmCorpus object to SparseSimilarityIndex whenever possible. Eligibility for the fast code path is determined by duck-typing, so any corpus supporting self.numDocs, self.numTerms and self.numElements will do (MmCorpus is one such example). 20 May 2011, 13:20:42 UTC
a02ad76 changed default dense chunks size to 256 in indexing (was: 100) * powers of 2 give the best performance, i guess due to better cache alignment 19 May 2011, 15:40:42 UTC
5eec39d fixed bug where scipy.sparse arrays cannot be sliced beyond their end (unlike plain lists or numpy arrays) 19 May 2011, 11:28:04 UTC
a8923b8 Dictionary.filterExtremes() keeps first 100k tokens by default (was: keeps all) 19 May 2011, 11:26:42 UTC
bb3e971 added 2 more tests to test/simspeed.py 16 May 2011, 21:26:59 UTC
739000a removed direct gemm calls from lsimodel (all numpy.dot now) 16 May 2011, 15:48:33 UTC
a6d1355 added `main` to test_lee unittest 16 May 2011, 15:48:04 UTC
15fc7bb added script for testing speed of similarity queries 15 May 2011, 11:19:15 UTC
a31eab9 added chunking to SparseMatrixSimilarity 15 May 2011, 11:19:15 UTC
f5ba1df chunked version of MatrixSimilarity 15 May 2011, 11:19:15 UTC
f4dc1d2 `lda.printTopic()` returns string (was: directly prints to log) It now acts the same as LsiModel. Printing to log is done via `printTopics()`, which calls `printTopic()` internally. 15 May 2011, 11:19:14 UTC
90dad8e added alias `stem` for `parsing.preprocessing.stem_text` 15 May 2011, 11:19:14 UTC
54e9083 replaced fortran-order arrays + scipy.linalg.*blas calls with plain numpy.dot on c-order arrays * tests on numpy (1.3.0rc2) show no difference anymore, and the code is cleaner * this means users need to have both numpy anad scipy linked against an optimized BLAS lib such as ATLAS (was: must have scipy linked). I guess most people upgrade ATLAS/numpy/scipy at the same time, so it should make no difference. 15 May 2011, 11:19:14 UTC
7711cbd allows to evaluate lsi models with lower dimensionality than originally trained It can be useful to train a model with e.g. 100 topics and then check how good the results would have been with only 10 topics. This can be done now by simply setting the model.numTopics variable to a lower level than before. 15 May 2011, 11:19:14 UTC
9d71766 make the top level directory of the repo a python module In the top level of the gensim repo you have to put a a __init__.py containing the following line: __path__ = './src/gensim' this points to the actual module. For me it is very useful because I want to record the commit hashes of the modules I am using for an experiment. basically it is the answer to the question: for our experiments we use a framework which records the version of all modules and if they are under version control also the hash of the commit. Usually I have gensim in my home directory and link it to the site-packages. But I link /Users/me/gensim/src/gensim and the .git directory is in /Users/me/gensim so this framework does not see that the module is under version control. Is there a way to link to /Users/me/gensim and somehow tell python that there is an module in ./src/gensim ? 15 May 2011, 11:19:14 UTC
dacc4f6 fixed threaded chunking over an empty corpus 04 May 2011, 16:21:19 UTC
b0e3f51 minor logging fixes 03 May 2011, 10:39:03 UTC
7c1b278 improved document chunking code in utils 03 May 2011, 10:37:29 UTC
c98dcfb removed (undocumented) dependencies on nose * switched test_lee.py from nose test to unittest (consistent with the rest of gensim) * added a matutils.triu_indices fnc for numpy < 1.4 * removed citation section from top-level README 24 April 2011, 22:04:11 UTC
33f0b5f Automated test to reproduce the results of Lee et al. (2005) Lee et al. (2005) compares different models for semantic similarity and verifies the results with similarity judgements from humans. The main result is that semantic similarites modelled by LSA have a correlation of 0.6 with human similarity judgements. As a validation of the gensim implementation we reproduced the results of Lee et al. (2005) in this test. Many thanks to Michael D. Lee (michael.lee@adelaide.edu.au) who provideded us with his corpus and similarity data. If you need to reference this dataset, please cite: Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society 24 April 2011, 22:04:11 UTC
b88225c recompute id2word after updating Dictionary with new docs `id2word` has "lazy" semantics (computed only when asked to), it acts like a cache. Previously the cache never got updated; now it's updated whenever the Dictionary changes and `id2word` is requested again. 24 April 2011, 22:04:11 UTC
eea0424 optimized memory in lda 16 April 2011, 13:29:45 UTC
98cbf17 fixed bz2/gzip input for MmCorpus * was broken by adding IndexedCorpus 16 April 2011, 09:46:01 UTC
8829701 Merge remote branch 'dedan/log_entropy_fix' into develop 04 April 2011, 19:01:40 UTC
03e1783 Fix a bug in the log_entropy_model The previous implementation had the mistake that it divided by the context diversity of a term instead of the total number of documents. It happened because I read it from a paper where the notation was misleading. Of course it could not be like this because then all terms with a context diversity of 1 would lead to a division by zero. For reference of the normalization see: Pincombe, B. (2004). Comparison of human and LSA judgements of pairwise document similarities for a news corpus. dspace.dsto.defence.gov.au. also the tests are changed to fit the new implementation 04 April 2011, 13:58:05 UTC
aa1d9b4 set all logger levels to NOTSET There was a request for the logging level to be configurable from a single point, preferably with a single command. See http://groups.google.com/group/gensim/browse_thread/thread/ff363fb5f07b6d01# for a discussion of the solution (NOTSET = the default logger level). 03 April 2011, 21:19:34 UTC
48c422c added __version__ attribute 03 April 2011, 20:47:05 UTC
d3b07c8 cleanup of BleiCorpus code 29 March 2011, 15:07:37 UTC
945f0f5 removed trailing whitespace; see github wiki https://github.com/piskvorky/gensim/wiki 29 March 2011, 11:27:40 UTC
3576a9b fixed Dieter's name in changelog 29 March 2011, 08:43:48 UTC
1755380 Merge branch 'issue17' into develop 27 March 2011, 15:01:23 UTC
5dd853f added hierarchical logging to all modules 27 March 2011, 14:52:07 UTC
b03531a added test directory to MANIFEST.in, so it gets distributed with source.tgz 26 March 2011, 15:20:05 UTC
6cd34f6 Merge branch 'release-0.7.8' 26 March 2011, 12:44:44 UTC
ee20ef2 Merge branch 'rename_serialize' into develop 26 March 2011, 12:38:37 UTC
10ea200 checked and updated documentation for new release * added API ref for IndexedCorpus * checked examples from tutorials are functional * updated examples to use Dictionary directly as id2word 26 March 2011, 12:36:47 UTC
42c4b7f renamed `saveIndexedCorpus` method to `serialize` ...and promoted it to be the default when saving corpora that support serialization (=most of them). `saveCorpus` should not be called directly anymore, `serialize` calls it internally automatically. 26 March 2011, 12:36:21 UTC
95b1ec0 up version (to 0.7.8) 26 March 2011, 10:53:57 UTC
0084801 Merge branch 'issue13' into develop 25 March 2011, 18:52:08 UTC
9f1cf92 regenerated all HTML for new release 25 March 2011, 18:43:49 UTC
bd1fce8 added HTML documentation for TextCorpus 24 March 2011, 13:59:29 UTC
b98e63a updated tutorial with streamed corpus The corpus=plain Python list was confusing people, some were copy&pasting the code form the tutorial, loading the entire corpus into memory. Then they ran out of memory and reported errors... Now the tutorial explicitly mentions this and gives an example of corpus as an iterable. 24 March 2011, 13:38:49 UTC
d5719d7 renamed `Dictionary.rebuildDictionary()` to `compactify()` 24 March 2011, 12:51:26 UTC
f1227d7 fixed LogEntropy transform for unknown term ids 17 March 2011, 08:04:13 UTC
8e96373 Merge branch 'cleanupfiles' into develop 16 March 2011, 11:30:38 UTC
c85ac5b cleaned up test data files (now all in a special dir) 16 March 2011, 11:28:21 UTC
48be6e1 Merge branch 'removetfidf' into develop 16 March 2011, 11:01:33 UTC
19a4dee removed parsing.tfidf module 16 March 2011, 10:59:09 UTC
c0e2b73 Merge branch 'dedan' into develop Conflicts: .gitignore 16 March 2011, 09:33:48 UTC
5410fc2 Added LogEntropy transformation model with tests and documentation 15 March 2011, 13:27:25 UTC
4bbbe82 fixed some comments 14 March 2011, 17:20:17 UTC
04a4e3b Fixed TextCorpus unittest I realized when the input is a stream (file-like object), we cannot pickle it. When input is a filename (string), or anything else as long as it's picklable, pickling works ok. 13 March 2011, 21:47:01 UTC
3412851 fixed comments; added forgotten bz2 wrapper 13 March 2011, 21:20:01 UTC
back to top