62fc3f7 | piskvorky | 28 June 2011, 19:31:19 UTC | Merge branch 'release-0.8.0' | 28 June 2011, 19:31:19 UTC |
e771488 | piskvorky | 28 June 2011, 19:27:22 UTC | improved final 0.8.0 documentation | 28 June 2011, 19:27:22 UTC |
b430c0b | piskvorky | 28 June 2011, 19:26:28 UTC | up version; preparing 0.8.0 final release | 28 June 2011, 19:26:28 UTC |
cc3c801 | piskvorky | 28 June 2011, 18:41:42 UTC | Merge branch 'develop' of github.com:piskvorky/gensim into develop | 28 June 2011, 18:41:42 UTC |
6ceb779 | piskvorky | 28 June 2011, 18:39:01 UTC | work around strange Pyro packaging (version numbers) * to be removed once the new Pyro (>=4.4) is integrated | 28 June 2011, 18:39:01 UTC |
2fd7f33 | piskvorky | 27 June 2011, 21:40:15 UTC | added alias any2utf8 for to_utf8 * and any2unicode for to_unicode | 27 June 2011, 21:40:15 UTC |
cacf66e | Radim Řehůřek | 27 June 2011, 11:06:07 UTC | Merge pull request #44 from dedan/develop fix the module import when linking to the git root instead of module | 27 June 2011, 11:06:07 UTC |
1a27732 | Stephan Gabler | 27 June 2011, 08:41:49 UTC | fix the module import when linking to the git root instead of module for some application I need to link to the gensim folder which is also the root of the repository. This script helps python to find the actual sourcecode of the module and had to be changed because radim moved the source within the repo | 27 June 2011, 08:41:49 UTC |
dd20e05 | piskvorky | 25 June 2011, 01:36:05 UTC | fixed one PEP8 orphan | 25 June 2011, 01:36:05 UTC |
f9560e5 | piskvorky | 22 June 2011, 16:22:06 UTC | added google analytics to gensim website | 22 June 2011, 16:22:06 UTC |
c9dd9d3 | piskvorky | 22 June 2011, 16:20:01 UTC | updated docs for chunks->chunksize rename | 22 June 2011, 16:20:01 UTC |
29fd2ae | piskvorky | 22 June 2011, 16:14:36 UTC | Merge branch 'develop' of github.com:piskvorky/gensim into develop Conflicts: gensim/test/test_models.py | 22 June 2011, 16:14:36 UTC |
81ef3b5 | Radim Řehůřek | 22 June 2011, 16:10:26 UTC | Merge pull request #40 from Dieterbe/develop Rename variable "chunks" to more sensible "chunksize" | 22 June 2011, 16:10:26 UTC |
1b3891b | Dieter Plaetinck | 22 June 2011, 15:54:25 UTC | Rename variable "chunks" to more sensible "chunksize" | 22 June 2011, 15:54:25 UTC |
917cd28 | piskvorky | 22 June 2011, 14:58:43 UTC | removed print_debug calls from the LSI unittest * was causing `invalid value in divide` warnings in numpy * see http://groups.google.com/group/gensim/browse_thread/thread/45c1c9efe91ce8d0 | 22 June 2011, 14:58:43 UTC |
947d1f9 | piskvorky | 19 June 2011, 23:39:33 UTC | Merge branch 'release-0.8.0rc1' | 19 June 2011, 23:39:33 UTC |
0e4ad96 | piskvorky | 19 June 2011, 23:36:32 UTC | put Download above TOC on title page | 19 June 2011, 23:36:32 UTC |
2471feb | piskvorky | 19 June 2011, 23:19:21 UTC | up version: 0.8.0rc1 | 19 June 2011, 23:20:17 UTC |
7c3e372 | piskvorky | 19 June 2011, 23:18:54 UTC | updated documentation for 0.8.0 release | 19 June 2011, 23:18:54 UTC |
63d61cc | piskvorky | 19 June 2011, 23:13:49 UTC | fixed length of (Sparse)MatrixSimilarity | 19 June 2011, 23:13:49 UTC |
0d1db95 | piskvorky | 18 June 2011, 10:17:28 UTC | improved doc strings | 18 June 2011, 10:17:28 UTC |
f40fd77 | piskvorky | 16 June 2011, 14:41:50 UTC | added mmap load/save to LsiModel | 16 June 2011, 14:41:50 UTC |
890dd7a | piskvorky | 16 June 2011, 14:08:13 UTC | Added chunking for lsi[corpus] transformation (about 3x faster) * before, lsi[corpus] was just syntactic sugar for (lsi[doc] for doc in corpus) * now, lsi[corpus] proceeds in chunks of documents (256 by default) and transforms each entire chunk at once * the reason is, transforming a chunk = matrix * matrix multiply, is faster than 256 single document transforms = matrix * vector multiplies (bc. of cache&co) | 16 June 2011, 14:08:13 UTC |
70cb2a5 | piskvorky | 15 June 2011, 01:46:19 UTC | Merge branch 'develop' of github.com:piskvorky/gensim into develop | 15 June 2011, 01:46:19 UTC |
885991a | piskvorky | 15 June 2011, 01:34:44 UTC | updated changelog and todo.txt for new release | 15 June 2011, 01:42:20 UTC |
73ed595 | piskvorky | 15 June 2011, 01:31:21 UTC | simplified dir structure: src/gensim/ -> gensim/ | 15 June 2011, 01:42:20 UTC |
58426c0 | piskvorky | 13 June 2011, 19:04:02 UTC | removed scipy 0.6 from supported versions | 15 June 2011, 01:33:09 UTC |
7ff5149 | piskvorky | 13 June 2011, 19:04:02 UTC | removed scipy 0.6 from supported versions | 13 June 2011, 19:04:02 UTC |
f90afc1 | piskvorky | 13 June 2011, 17:46:46 UTC | Merge branch 'sharding' into develop | 13 June 2011, 17:46:46 UTC |
e0932f8 | piskvorky | 13 June 2011, 15:08:07 UTC | added script for testing speed of Similarity | 13 June 2011, 15:08:47 UTC |
88f2a3b | piskvorky | 13 June 2011, 14:58:42 UTC | updated docs to reflect PEP8 changes * also fixed and updated several doc strings and comments, esp. docsim.py | 13 June 2011, 15:08:47 UTC |
482c73f | piskvorky | 13 June 2011, 13:27:41 UTC | added chunking to Similarity | 13 June 2011, 14:59:52 UTC |
4c5cf51 | piskvorky | 12 June 2011, 22:55:53 UTC | added unit tests for similarities * 1st working version of sharded Similarity | 13 June 2011, 12:45:22 UTC |
fe01d93 | piskvorky | 12 June 2011, 22:53:19 UTC | changed default LsiModel chunk size to 20k (was: 10k) | 12 June 2011, 22:53:19 UTC |
9bf3d05 | piskvorky | 09 June 2011, 18:05:04 UTC | removed threaded chunking * users reported problems and the speed gain was small... * now uses simple itertools.groupby to chunk again, like in 0.7.7 | 10 June 2011, 12:37:01 UTC |
8fca994 | piskvorky | 08 June 2011, 19:46:19 UTC | mmap'ed (Sparse)MatrixSimilarity save/load + renamed .corpus to .index | 09 June 2011, 09:59:55 UTC |
6e5ed94 | piskvorky | 07 June 2011, 23:51:24 UTC | added sharding to similarity index | 09 June 2011, 09:59:55 UTC |
b59eb47 | piskvorky | 07 June 2011, 13:21:18 UTC | re #10: PEP8-fied function/variable names * backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc. * documentation to be updated in a separate commit | 09 June 2011, 09:59:29 UTC |
f564aa2 | piskvorky | 07 June 2011, 13:21:18 UTC | * backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc. | 07 June 2011, 13:24:30 UTC |
fd8c32a | piskvorky | 07 June 2011, 10:56:59 UTC | moved dmlcz to the `examples` subdirectory | 07 June 2011, 13:24:30 UTC |
df6f3e5 | piskvorky | 06 June 2011, 10:26:02 UTC | deleted old unused SVD algos | 06 June 2011, 10:26:02 UTC |
5610002 | piskvorky | 01 June 2011, 20:25:42 UTC | turn off threading in chunking by default * users reported problems (gensim stalling indefinitely, some deadlock?), http://groups.google.com/group/gensim/browse_thread/thread/c834e0c61eb50548 | 01 June 2011, 20:25:42 UTC |
3e8ef71 | piskvorky | 25 May 2011, 17:52:13 UTC | simplified logic of vector/corpus overload in index[query] + added another speed test | 25 May 2011, 18:12:46 UTC |
81e933f | piskvorky | 25 May 2011, 17:50:29 UTC | unitVec returns scipy.sparse output for scipy.sparse input (was: returns dense numpy array) | 25 May 2011, 17:50:29 UTC |
1065626 | piskvorky | 21 May 2011, 19:43:22 UTC | added little memory optimization in scipy.sparse operations | 21 May 2011, 19:43:22 UTC |
628f30d | piskvorky | 20 May 2011, 13:54:47 UTC | improved documentation of utils fncs | 20 May 2011, 13:56:53 UTC |
d6974e9 | piskvorky | 20 May 2011, 13:17:16 UTC | allow unicode in filterWiki fnc (was: only utf8) | 20 May 2011, 13:46:02 UTC |
796f6b5 | piskvorky | 20 May 2011, 11:10:48 UTC | more efficient sparse matrix generation When the sparse properties (#documents, #terms, #non-zeroes) are known in advance, a much more efficient code path is taken. This is the case with MmCorpus, so pass a MmCorpus object to SparseSimilarityIndex whenever possible. Eligibility for the fast code path is determined by duck-typing, so any corpus supporting self.numDocs, self.numTerms and self.numElements will do (MmCorpus is one such example). | 20 May 2011, 13:20:42 UTC |
a02ad76 | piskvorky | 19 May 2011, 11:29:00 UTC | changed default dense chunks size to 256 in indexing (was: 100) * powers of 2 give the best performance, i guess due to better cache alignment | 19 May 2011, 15:40:42 UTC |
5eec39d | piskvorky | 19 May 2011, 11:28:04 UTC | fixed bug where scipy.sparse arrays cannot be sliced beyond their end (unlike plain lists or numpy arrays) | 19 May 2011, 11:28:04 UTC |
a8923b8 | piskvorky | 19 May 2011, 11:26:42 UTC | Dictionary.filterExtremes() keeps first 100k tokens by default (was: keeps all) | 19 May 2011, 11:26:42 UTC |
bb3e971 | piskvorky | 16 May 2011, 16:28:18 UTC | added 2 more tests to test/simspeed.py | 16 May 2011, 21:26:59 UTC |
739000a | piskvorky | 16 May 2011, 15:48:33 UTC | removed direct gemm calls from lsimodel (all numpy.dot now) | 16 May 2011, 15:48:33 UTC |
a6d1355 | piskvorky | 16 May 2011, 15:48:04 UTC | added `main` to test_lee unittest | 16 May 2011, 15:48:04 UTC |
15fc7bb | piskvorky | 14 May 2011, 10:52:22 UTC | added script for testing speed of similarity queries | 15 May 2011, 11:19:15 UTC |
a31eab9 | piskvorky | 14 May 2011, 10:51:57 UTC | added chunking to SparseMatrixSimilarity | 15 May 2011, 11:19:15 UTC |
f5ba1df | piskvorky | 13 May 2011, 01:52:45 UTC | chunked version of MatrixSimilarity | 15 May 2011, 11:19:15 UTC |
f4dc1d2 | piskvorky | 09 May 2011, 17:12:55 UTC | `lda.printTopic()` returns string (was: directly prints to log) It now acts the same as LsiModel. Printing to log is done via `printTopics()`, which calls `printTopic()` internally. | 15 May 2011, 11:19:14 UTC |
90dad8e | piskvorky | 06 May 2011, 19:11:10 UTC | added alias `stem` for `parsing.preprocessing.stem_text` | 15 May 2011, 11:19:14 UTC |
54e9083 | piskvorky | 15 May 2011, 10:54:12 UTC | replaced fortran-order arrays + scipy.linalg.*blas calls with plain numpy.dot on c-order arrays * tests on numpy (1.3.0rc2) show no difference anymore, and the code is cleaner * this means users need to have both numpy anad scipy linked against an optimized BLAS lib such as ATLAS (was: must have scipy linked). I guess most people upgrade ATLAS/numpy/scipy at the same time, so it should make no difference. | 15 May 2011, 11:19:14 UTC |
7711cbd | Stephan Gabler | 28 April 2011, 17:12:56 UTC | allows to evaluate lsi models with lower dimensionality than originally trained It can be useful to train a model with e.g. 100 topics and then check how good the results would have been with only 10 topics. This can be done now by simply setting the model.numTopics variable to a lower level than before. | 15 May 2011, 11:19:14 UTC |
9d71766 | Stephan Gabler | 28 April 2011, 17:09:51 UTC | make the top level directory of the repo a python module In the top level of the gensim repo you have to put a a __init__.py containing the following line: __path__ = './src/gensim' this points to the actual module. For me it is very useful because I want to record the commit hashes of the modules I am using for an experiment. basically it is the answer to the question: for our experiments we use a framework which records the version of all modules and if they are under version control also the hash of the commit. Usually I have gensim in my home directory and link it to the site-packages. But I link /Users/me/gensim/src/gensim and the .git directory is in /Users/me/gensim so this framework does not see that the module is under version control. Is there a way to link to /Users/me/gensim and somehow tell python that there is an module in ./src/gensim ? | 15 May 2011, 11:19:14 UTC |
dacc4f6 | piskvorky | 04 May 2011, 16:21:19 UTC | fixed threaded chunking over an empty corpus | 04 May 2011, 16:21:19 UTC |
b0e3f51 | piskvorky | 03 May 2011, 10:39:03 UTC | minor logging fixes | 03 May 2011, 10:39:03 UTC |
7c1b278 | piskvorky | 03 May 2011, 10:37:29 UTC | improved document chunking code in utils | 03 May 2011, 10:37:29 UTC |
c98dcfb | piskvorky | 13 April 2011, 13:17:16 UTC | removed (undocumented) dependencies on nose * switched test_lee.py from nose test to unittest (consistent with the rest of gensim) * added a matutils.triu_indices fnc for numpy < 1.4 * removed citation section from top-level README | 24 April 2011, 22:04:11 UTC |
33f0b5f | Stephan Gabler | 04 April 2011, 11:55:33 UTC | Automated test to reproduce the results of Lee et al. (2005) Lee et al. (2005) compares different models for semantic similarity and verifies the results with similarity judgements from humans. The main result is that semantic similarites modelled by LSA have a correlation of 0.6 with human similarity judgements. As a validation of the gensim implementation we reproduced the results of Lee et al. (2005) in this test. Many thanks to Michael D. Lee (michael.lee@adelaide.edu.au) who provideded us with his corpus and similarity data. If you need to reference this dataset, please cite: Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society | 24 April 2011, 22:04:11 UTC |
b88225c | piskvorky | 22 April 2011, 08:37:04 UTC | recompute id2word after updating Dictionary with new docs `id2word` has "lazy" semantics (computed only when asked to), it acts like a cache. Previously the cache never got updated; now it's updated whenever the Dictionary changes and `id2word` is requested again. | 24 April 2011, 22:04:11 UTC |
eea0424 | piskvorky | 16 April 2011, 13:12:56 UTC | optimized memory in lda | 16 April 2011, 13:29:45 UTC |
98cbf17 | piskvorky | 16 April 2011, 09:46:01 UTC | fixed bz2/gzip input for MmCorpus * was broken by adding IndexedCorpus | 16 April 2011, 09:46:01 UTC |
8829701 | piskvorky | 04 April 2011, 19:01:40 UTC | Merge remote branch 'dedan/log_entropy_fix' into develop | 04 April 2011, 19:01:40 UTC |
03e1783 | Stephan Gabler | 04 April 2011, 13:58:05 UTC | Fix a bug in the log_entropy_model The previous implementation had the mistake that it divided by the context diversity of a term instead of the total number of documents. It happened because I read it from a paper where the notation was misleading. Of course it could not be like this because then all terms with a context diversity of 1 would lead to a division by zero. For reference of the normalization see: Pincombe, B. (2004). Comparison of human and LSA judgements of pairwise document similarities for a news corpus. dspace.dsto.defence.gov.au. also the tests are changed to fit the new implementation | 04 April 2011, 13:58:05 UTC |
aa1d9b4 | piskvorky | 03 April 2011, 21:19:34 UTC | set all logger levels to NOTSET There was a request for the logging level to be configurable from a single point, preferably with a single command. See http://groups.google.com/group/gensim/browse_thread/thread/ff363fb5f07b6d01# for a discussion of the solution (NOTSET = the default logger level). | 03 April 2011, 21:19:34 UTC |
48c422c | piskvorky | 03 April 2011, 20:47:05 UTC | added __version__ attribute | 03 April 2011, 20:47:05 UTC |
d3b07c8 | piskvorky | 29 March 2011, 15:07:37 UTC | cleanup of BleiCorpus code | 29 March 2011, 15:07:37 UTC |
945f0f5 | piskvorky | 29 March 2011, 11:27:40 UTC | removed trailing whitespace; see github wiki https://github.com/piskvorky/gensim/wiki | 29 March 2011, 11:27:40 UTC |
3576a9b | piskvorky | 29 March 2011, 08:43:48 UTC | fixed Dieter's name in changelog | 29 March 2011, 08:43:48 UTC |
1755380 | piskvorky | 27 March 2011, 15:01:23 UTC | Merge branch 'issue17' into develop | 27 March 2011, 15:01:23 UTC |
5dd853f | David Nemeskey | 26 March 2011, 11:44:10 UTC | added hierarchical logging to all modules | 27 March 2011, 14:52:07 UTC |
b03531a | piskvorky | 26 March 2011, 15:20:05 UTC | added test directory to MANIFEST.in, so it gets distributed with source.tgz | 26 March 2011, 15:20:05 UTC |
6cd34f6 | piskvorky | 26 March 2011, 12:44:44 UTC | Merge branch 'release-0.7.8' | 26 March 2011, 12:44:44 UTC |
ee20ef2 | piskvorky | 26 March 2011, 12:38:37 UTC | Merge branch 'rename_serialize' into develop | 26 March 2011, 12:38:37 UTC |
10ea200 | piskvorky | 26 March 2011, 12:26:29 UTC | checked and updated documentation for new release * added API ref for IndexedCorpus * checked examples from tutorials are functional * updated examples to use Dictionary directly as id2word | 26 March 2011, 12:36:47 UTC |
42c4b7f | piskvorky | 26 March 2011, 12:23:47 UTC | renamed `saveIndexedCorpus` method to `serialize` ...and promoted it to be the default when saving corpora that support serialization (=most of them). `saveCorpus` should not be called directly anymore, `serialize` calls it internally automatically. | 26 March 2011, 12:36:21 UTC |
95b1ec0 | piskvorky | 26 March 2011, 10:53:57 UTC | up version (to 0.7.8) | 26 March 2011, 10:53:57 UTC |
0084801 | piskvorky | 25 March 2011, 18:52:08 UTC | Merge branch 'issue13' into develop | 25 March 2011, 18:52:08 UTC |
9f1cf92 | piskvorky | 25 March 2011, 18:43:49 UTC | regenerated all HTML for new release | 25 March 2011, 18:43:49 UTC |
bd1fce8 | piskvorky | 24 March 2011, 13:59:29 UTC | added HTML documentation for TextCorpus | 24 March 2011, 13:59:29 UTC |
b98e63a | piskvorky | 24 March 2011, 13:38:49 UTC | updated tutorial with streamed corpus The corpus=plain Python list was confusing people, some were copy&pasting the code form the tutorial, loading the entire corpus into memory. Then they ran out of memory and reported errors... Now the tutorial explicitly mentions this and gives an example of corpus as an iterable. | 24 March 2011, 13:38:49 UTC |
d5719d7 | piskvorky | 24 March 2011, 12:51:26 UTC | renamed `Dictionary.rebuildDictionary()` to `compactify()` | 24 March 2011, 12:51:26 UTC |
f1227d7 | piskvorky | 17 March 2011, 08:04:13 UTC | fixed LogEntropy transform for unknown term ids | 17 March 2011, 08:04:13 UTC |
8e96373 | piskvorky | 16 March 2011, 11:30:38 UTC | Merge branch 'cleanupfiles' into develop | 16 March 2011, 11:30:38 UTC |
c85ac5b | piskvorky | 16 March 2011, 11:28:21 UTC | cleaned up test data files (now all in a special dir) | 16 March 2011, 11:28:21 UTC |
48be6e1 | piskvorky | 16 March 2011, 11:01:33 UTC | Merge branch 'removetfidf' into develop | 16 March 2011, 11:01:33 UTC |
19a4dee | piskvorky | 16 March 2011, 10:59:09 UTC | removed parsing.tfidf module | 16 March 2011, 10:59:09 UTC |
c0e2b73 | piskvorky | 16 March 2011, 09:33:48 UTC | Merge branch 'dedan' into develop Conflicts: .gitignore | 16 March 2011, 09:33:48 UTC |
5410fc2 | Stephan Gabler | 09 March 2011, 17:14:12 UTC | Added LogEntropy transformation model with tests and documentation | 15 March 2011, 13:27:25 UTC |
4bbbe82 | piskvorky | 14 March 2011, 17:20:17 UTC | fixed some comments | 14 March 2011, 17:20:17 UTC |
04a4e3b | piskvorky | 13 March 2011, 21:47:01 UTC | Fixed TextCorpus unittest I realized when the input is a stream (file-like object), we cannot pickle it. When input is a filename (string), or anything else as long as it's picklable, pickling works ok. | 13 March 2011, 21:47:01 UTC |
3412851 | piskvorky | 13 March 2011, 21:20:01 UTC | fixed comments; added forgotten bz2 wrapper | 13 March 2011, 21:20:01 UTC |