Revision history - None - origin: https://github.com/RaRe-Technologies/gensim

visit type:

Revision	Author	Date	Message	Commit Date
62fc3f7	piskvorky	28 June 2011, 19:31:19 UTC	Merge branch 'release-0.8.0'	28 June 2011, 19:31:19 UTC
e771488	piskvorky	28 June 2011, 19:27:22 UTC	improved final 0.8.0 documentation	28 June 2011, 19:27:22 UTC
b430c0b	piskvorky	28 June 2011, 19:26:28 UTC	up version; preparing 0.8.0 final release	28 June 2011, 19:26:28 UTC
cc3c801	piskvorky	28 June 2011, 18:41:42 UTC	Merge branch 'develop' of github.com:piskvorky/gensim into develop	28 June 2011, 18:41:42 UTC
6ceb779	piskvorky	28 June 2011, 18:39:01 UTC	work around strange Pyro packaging (version numbers) * to be removed once the new Pyro (>=4.4) is integrated	28 June 2011, 18:39:01 UTC
2fd7f33	piskvorky	27 June 2011, 21:40:15 UTC	added alias any2utf8 for to_utf8 * and any2unicode for to_unicode	27 June 2011, 21:40:15 UTC
cacf66e	Radim Řehůřek	27 June 2011, 11:06:07 UTC	Merge pull request #44 from dedan/develop fix the module import when linking to the git root instead of module	27 June 2011, 11:06:07 UTC
1a27732	Stephan Gabler	27 June 2011, 08:41:49 UTC	fix the module import when linking to the git root instead of module for some application I need to link to the gensim folder which is also the root of the repository. This script helps python to find the actual sourcecode of the module and had to be changed because radim moved the source within the repo	27 June 2011, 08:41:49 UTC
dd20e05	piskvorky	25 June 2011, 01:36:05 UTC	fixed one PEP8 orphan	25 June 2011, 01:36:05 UTC
f9560e5	piskvorky	22 June 2011, 16:22:06 UTC	added google analytics to gensim website	22 June 2011, 16:22:06 UTC
c9dd9d3	piskvorky	22 June 2011, 16:20:01 UTC	updated docs for chunks->chunksize rename	22 June 2011, 16:20:01 UTC
29fd2ae	piskvorky	22 June 2011, 16:14:36 UTC	Merge branch 'develop' of github.com:piskvorky/gensim into develop Conflicts: gensim/test/test_models.py	22 June 2011, 16:14:36 UTC
81ef3b5	Radim Řehůřek	22 June 2011, 16:10:26 UTC	Merge pull request #40 from Dieterbe/develop Rename variable "chunks" to more sensible "chunksize"	22 June 2011, 16:10:26 UTC
1b3891b	Dieter Plaetinck	22 June 2011, 15:54:25 UTC	Rename variable "chunks" to more sensible "chunksize"	22 June 2011, 15:54:25 UTC
917cd28	piskvorky	22 June 2011, 14:58:43 UTC	removed print_debug calls from the LSI unittest * was causing `invalid value in divide` warnings in numpy * see http://groups.google.com/group/gensim/browse_thread/thread/45c1c9efe91ce8d0	22 June 2011, 14:58:43 UTC
947d1f9	piskvorky	19 June 2011, 23:39:33 UTC	Merge branch 'release-0.8.0rc1'	19 June 2011, 23:39:33 UTC
0e4ad96	piskvorky	19 June 2011, 23:36:32 UTC	put Download above TOC on title page	19 June 2011, 23:36:32 UTC
2471feb	piskvorky	19 June 2011, 23:19:21 UTC	up version: 0.8.0rc1	19 June 2011, 23:20:17 UTC
7c3e372	piskvorky	19 June 2011, 23:18:54 UTC	updated documentation for 0.8.0 release	19 June 2011, 23:18:54 UTC
63d61cc	piskvorky	19 June 2011, 23:13:49 UTC	fixed length of (Sparse)MatrixSimilarity	19 June 2011, 23:13:49 UTC
0d1db95	piskvorky	18 June 2011, 10:17:28 UTC	improved doc strings	18 June 2011, 10:17:28 UTC
f40fd77	piskvorky	16 June 2011, 14:41:50 UTC	added mmap load/save to LsiModel	16 June 2011, 14:41:50 UTC
890dd7a	piskvorky	16 June 2011, 14:08:13 UTC	Added chunking for lsi[corpus] transformation (about 3x faster) * before, lsi[corpus] was just syntactic sugar for (lsi[doc] for doc in corpus) * now, lsi[corpus] proceeds in chunks of documents (256 by default) and transforms each entire chunk at once * the reason is, transforming a chunk = matrix * matrix multiply, is faster than 256 single document transforms = matrix * vector multiplies (bc. of cache&co)	16 June 2011, 14:08:13 UTC
70cb2a5	piskvorky	15 June 2011, 01:46:19 UTC	Merge branch 'develop' of github.com:piskvorky/gensim into develop	15 June 2011, 01:46:19 UTC
885991a	piskvorky	15 June 2011, 01:34:44 UTC	updated changelog and todo.txt for new release	15 June 2011, 01:42:20 UTC
73ed595	piskvorky	15 June 2011, 01:31:21 UTC	simplified dir structure: src/gensim/ -> gensim/	15 June 2011, 01:42:20 UTC
58426c0	piskvorky	13 June 2011, 19:04:02 UTC	removed scipy 0.6 from supported versions	15 June 2011, 01:33:09 UTC
7ff5149	piskvorky	13 June 2011, 19:04:02 UTC	removed scipy 0.6 from supported versions	13 June 2011, 19:04:02 UTC
f90afc1	piskvorky	13 June 2011, 17:46:46 UTC	Merge branch 'sharding' into develop	13 June 2011, 17:46:46 UTC
e0932f8	piskvorky	13 June 2011, 15:08:07 UTC	added script for testing speed of Similarity	13 June 2011, 15:08:47 UTC
88f2a3b	piskvorky	13 June 2011, 14:58:42 UTC	updated docs to reflect PEP8 changes * also fixed and updated several doc strings and comments, esp. docsim.py	13 June 2011, 15:08:47 UTC
482c73f	piskvorky	13 June 2011, 13:27:41 UTC	added chunking to Similarity	13 June 2011, 14:59:52 UTC
4c5cf51	piskvorky	12 June 2011, 22:55:53 UTC	added unit tests for similarities * 1st working version of sharded Similarity	13 June 2011, 12:45:22 UTC
fe01d93	piskvorky	12 June 2011, 22:53:19 UTC	changed default LsiModel chunk size to 20k (was: 10k)	12 June 2011, 22:53:19 UTC
9bf3d05	piskvorky	09 June 2011, 18:05:04 UTC	removed threaded chunking * users reported problems and the speed gain was small... * now uses simple itertools.groupby to chunk again, like in 0.7.7	10 June 2011, 12:37:01 UTC
8fca994	piskvorky	08 June 2011, 19:46:19 UTC	mmap'ed (Sparse)MatrixSimilarity save/load + renamed .corpus to .index	09 June 2011, 09:59:55 UTC
6e5ed94	piskvorky	07 June 2011, 23:51:24 UTC	added sharding to similarity index	09 June 2011, 09:59:55 UTC
b59eb47	piskvorky	07 June 2011, 13:21:18 UTC	re #10: PEP8-fied function/variable names * backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc. * documentation to be updated in a separate commit	09 June 2011, 09:59:29 UTC
f564aa2	piskvorky	07 June 2011, 13:21:18 UTC	* backwards incompatible, breaks all existing code! * but the changes are straightforward: numTopics => num_topics, addDocuments => add_documents etc.	07 June 2011, 13:24:30 UTC
fd8c32a	piskvorky	07 June 2011, 10:56:59 UTC	moved dmlcz to the `examples` subdirectory	07 June 2011, 13:24:30 UTC
df6f3e5	piskvorky	06 June 2011, 10:26:02 UTC	deleted old unused SVD algos	06 June 2011, 10:26:02 UTC
5610002	piskvorky	01 June 2011, 20:25:42 UTC	turn off threading in chunking by default * users reported problems (gensim stalling indefinitely, some deadlock?), http://groups.google.com/group/gensim/browse_thread/thread/c834e0c61eb50548	01 June 2011, 20:25:42 UTC
3e8ef71	piskvorky	25 May 2011, 17:52:13 UTC	simplified logic of vector/corpus overload in index[query] + added another speed test	25 May 2011, 18:12:46 UTC
81e933f	piskvorky	25 May 2011, 17:50:29 UTC	unitVec returns scipy.sparse output for scipy.sparse input (was: returns dense numpy array)	25 May 2011, 17:50:29 UTC
1065626	piskvorky	21 May 2011, 19:43:22 UTC	added little memory optimization in scipy.sparse operations	21 May 2011, 19:43:22 UTC
628f30d	piskvorky	20 May 2011, 13:54:47 UTC	improved documentation of utils fncs	20 May 2011, 13:56:53 UTC
d6974e9	piskvorky	20 May 2011, 13:17:16 UTC	allow unicode in filterWiki fnc (was: only utf8)	20 May 2011, 13:46:02 UTC
796f6b5	piskvorky	20 May 2011, 11:10:48 UTC	more efficient sparse matrix generation When the sparse properties (#documents, #terms, #non-zeroes) are known in advance, a much more efficient code path is taken. This is the case with MmCorpus, so pass a MmCorpus object to SparseSimilarityIndex whenever possible. Eligibility for the fast code path is determined by duck-typing, so any corpus supporting self.numDocs, self.numTerms and self.numElements will do (MmCorpus is one such example).	20 May 2011, 13:20:42 UTC
a02ad76	piskvorky	19 May 2011, 11:29:00 UTC	changed default dense chunks size to 256 in indexing (was: 100) * powers of 2 give the best performance, i guess due to better cache alignment	19 May 2011, 15:40:42 UTC
5eec39d	piskvorky	19 May 2011, 11:28:04 UTC	fixed bug where scipy.sparse arrays cannot be sliced beyond their end (unlike plain lists or numpy arrays)	19 May 2011, 11:28:04 UTC
a8923b8	piskvorky	19 May 2011, 11:26:42 UTC	Dictionary.filterExtremes() keeps first 100k tokens by default (was: keeps all)	19 May 2011, 11:26:42 UTC
bb3e971	piskvorky	16 May 2011, 16:28:18 UTC	added 2 more tests to test/simspeed.py	16 May 2011, 21:26:59 UTC
739000a	piskvorky	16 May 2011, 15:48:33 UTC	removed direct gemm calls from lsimodel (all numpy.dot now)	16 May 2011, 15:48:33 UTC
a6d1355	piskvorky	16 May 2011, 15:48:04 UTC	added `main` to test_lee unittest	16 May 2011, 15:48:04 UTC
15fc7bb	piskvorky	14 May 2011, 10:52:22 UTC	added script for testing speed of similarity queries	15 May 2011, 11:19:15 UTC
a31eab9	piskvorky	14 May 2011, 10:51:57 UTC	added chunking to SparseMatrixSimilarity	15 May 2011, 11:19:15 UTC
f5ba1df	piskvorky	13 May 2011, 01:52:45 UTC	chunked version of MatrixSimilarity	15 May 2011, 11:19:15 UTC
f4dc1d2	piskvorky	09 May 2011, 17:12:55 UTC	`lda.printTopic()` returns string (was: directly prints to log) It now acts the same as LsiModel. Printing to log is done via `printTopics()`, which calls `printTopic()` internally.	15 May 2011, 11:19:14 UTC
90dad8e	piskvorky	06 May 2011, 19:11:10 UTC	added alias `stem` for `parsing.preprocessing.stem_text`	15 May 2011, 11:19:14 UTC
54e9083	piskvorky	15 May 2011, 10:54:12 UTC	replaced fortran-order arrays + scipy.linalg.blas calls with plain numpy.dot on c-order arrays tests on numpy (1.3.0rc2) show no difference anymore, and the code is cleaner * this means users need to have both numpy anad scipy linked against an optimized BLAS lib such as ATLAS (was: must have scipy linked). I guess most people upgrade ATLAS/numpy/scipy at the same time, so it should make no difference.	15 May 2011, 11:19:14 UTC
7711cbd	Stephan Gabler	28 April 2011, 17:12:56 UTC	allows to evaluate lsi models with lower dimensionality than originally trained It can be useful to train a model with e.g. 100 topics and then check how good the results would have been with only 10 topics. This can be done now by simply setting the model.numTopics variable to a lower level than before.	15 May 2011, 11:19:14 UTC
9d71766	Stephan Gabler	28 April 2011, 17:09:51 UTC	make the top level directory of the repo a python module In the top level of the gensim repo you have to put a a __init__.py containing the following line: __path__ = './src/gensim' this points to the actual module. For me it is very useful because I want to record the commit hashes of the modules I am using for an experiment. basically it is the answer to the question: for our experiments we use a framework which records the version of all modules and if they are under version control also the hash of the commit. Usually I have gensim in my home directory and link it to the site-packages. But I link /Users/me/gensim/src/gensim and the .git directory is in /Users/me/gensim so this framework does not see that the module is under version control. Is there a way to link to /Users/me/gensim and somehow tell python that there is an module in ./src/gensim ?	15 May 2011, 11:19:14 UTC
dacc4f6	piskvorky	04 May 2011, 16:21:19 UTC	fixed threaded chunking over an empty corpus	04 May 2011, 16:21:19 UTC
b0e3f51	piskvorky	03 May 2011, 10:39:03 UTC	minor logging fixes	03 May 2011, 10:39:03 UTC
7c1b278	piskvorky	03 May 2011, 10:37:29 UTC	improved document chunking code in utils	03 May 2011, 10:37:29 UTC
c98dcfb	piskvorky	13 April 2011, 13:17:16 UTC	removed (undocumented) dependencies on nose * switched test_lee.py from nose test to unittest (consistent with the rest of gensim) * added a matutils.triu_indices fnc for numpy < 1.4 * removed citation section from top-level README	24 April 2011, 22:04:11 UTC
33f0b5f	Stephan Gabler	04 April 2011, 11:55:33 UTC	Automated test to reproduce the results of Lee et al. (2005) Lee et al. (2005) compares different models for semantic similarity and verifies the results with similarity judgements from humans. The main result is that semantic similarites modelled by LSA have a correlation of 0.6 with human similarity judgements. As a validation of the gensim implementation we reproduced the results of Lee et al. (2005) in this test. Many thanks to Michael D. Lee (michael.lee@adelaide.edu.au) who provideded us with his corpus and similarity data. If you need to reference this dataset, please cite: Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society	24 April 2011, 22:04:11 UTC
b88225c	piskvorky	22 April 2011, 08:37:04 UTC	recompute id2word after updating Dictionary with new docs `id2word` has "lazy" semantics (computed only when asked to), it acts like a cache. Previously the cache never got updated; now it's updated whenever the Dictionary changes and `id2word` is requested again.	24 April 2011, 22:04:11 UTC
eea0424	piskvorky	16 April 2011, 13:12:56 UTC	optimized memory in lda	16 April 2011, 13:29:45 UTC
98cbf17	piskvorky	16 April 2011, 09:46:01 UTC	fixed bz2/gzip input for MmCorpus * was broken by adding IndexedCorpus	16 April 2011, 09:46:01 UTC
8829701	piskvorky	04 April 2011, 19:01:40 UTC	Merge remote branch 'dedan/log_entropy_fix' into develop	04 April 2011, 19:01:40 UTC
03e1783	Stephan Gabler	04 April 2011, 13:58:05 UTC	Fix a bug in the log_entropy_model The previous implementation had the mistake that it divided by the context diversity of a term instead of the total number of documents. It happened because I read it from a paper where the notation was misleading. Of course it could not be like this because then all terms with a context diversity of 1 would lead to a division by zero. For reference of the normalization see: Pincombe, B. (2004). Comparison of human and LSA judgements of pairwise document similarities for a news corpus. dspace.dsto.defence.gov.au. also the tests are changed to fit the new implementation	04 April 2011, 13:58:05 UTC
aa1d9b4	piskvorky	03 April 2011, 21:19:34 UTC	set all logger levels to NOTSET There was a request for the logging level to be configurable from a single point, preferably with a single command. See http://groups.google.com/group/gensim/browse_thread/thread/ff363fb5f07b6d01# for a discussion of the solution (NOTSET = the default logger level).	03 April 2011, 21:19:34 UTC
48c422c	piskvorky	03 April 2011, 20:47:05 UTC	added __version__ attribute	03 April 2011, 20:47:05 UTC
d3b07c8	piskvorky	29 March 2011, 15:07:37 UTC	cleanup of BleiCorpus code	29 March 2011, 15:07:37 UTC
945f0f5	piskvorky	29 March 2011, 11:27:40 UTC	removed trailing whitespace; see github wiki https://github.com/piskvorky/gensim/wiki	29 March 2011, 11:27:40 UTC
3576a9b	piskvorky	29 March 2011, 08:43:48 UTC	fixed Dieter's name in changelog	29 March 2011, 08:43:48 UTC
1755380	piskvorky	27 March 2011, 15:01:23 UTC	Merge branch 'issue17' into develop	27 March 2011, 15:01:23 UTC
5dd853f	David Nemeskey	26 March 2011, 11:44:10 UTC	added hierarchical logging to all modules	27 March 2011, 14:52:07 UTC
b03531a	piskvorky	26 March 2011, 15:20:05 UTC	added test directory to MANIFEST.in, so it gets distributed with source.tgz	26 March 2011, 15:20:05 UTC
6cd34f6	piskvorky	26 March 2011, 12:44:44 UTC	Merge branch 'release-0.7.8'	26 March 2011, 12:44:44 UTC
ee20ef2	piskvorky	26 March 2011, 12:38:37 UTC	Merge branch 'rename_serialize' into develop	26 March 2011, 12:38:37 UTC
10ea200	piskvorky	26 March 2011, 12:26:29 UTC	checked and updated documentation for new release * added API ref for IndexedCorpus * checked examples from tutorials are functional * updated examples to use Dictionary directly as id2word	26 March 2011, 12:36:47 UTC
42c4b7f	piskvorky	26 March 2011, 12:23:47 UTC	renamed `saveIndexedCorpus` method to `serialize` ...and promoted it to be the default when saving corpora that support serialization (=most of them). `saveCorpus` should not be called directly anymore, `serialize` calls it internally automatically.	26 March 2011, 12:36:21 UTC
95b1ec0	piskvorky	26 March 2011, 10:53:57 UTC	up version (to 0.7.8)	26 March 2011, 10:53:57 UTC
0084801	piskvorky	25 March 2011, 18:52:08 UTC	Merge branch 'issue13' into develop	25 March 2011, 18:52:08 UTC
9f1cf92	piskvorky	25 March 2011, 18:43:49 UTC	regenerated all HTML for new release	25 March 2011, 18:43:49 UTC
bd1fce8	piskvorky	24 March 2011, 13:59:29 UTC	added HTML documentation for TextCorpus	24 March 2011, 13:59:29 UTC
b98e63a	piskvorky	24 March 2011, 13:38:49 UTC	updated tutorial with streamed corpus The corpus=plain Python list was confusing people, some were copy&pasting the code form the tutorial, loading the entire corpus into memory. Then they ran out of memory and reported errors... Now the tutorial explicitly mentions this and gives an example of corpus as an iterable.	24 March 2011, 13:38:49 UTC
d5719d7	piskvorky	24 March 2011, 12:51:26 UTC	renamed `Dictionary.rebuildDictionary()` to `compactify()`	24 March 2011, 12:51:26 UTC
f1227d7	piskvorky	17 March 2011, 08:04:13 UTC	fixed LogEntropy transform for unknown term ids	17 March 2011, 08:04:13 UTC
8e96373	piskvorky	16 March 2011, 11:30:38 UTC	Merge branch 'cleanupfiles' into develop	16 March 2011, 11:30:38 UTC
c85ac5b	piskvorky	16 March 2011, 11:28:21 UTC	cleaned up test data files (now all in a special dir)	16 March 2011, 11:28:21 UTC
48be6e1	piskvorky	16 March 2011, 11:01:33 UTC	Merge branch 'removetfidf' into develop	16 March 2011, 11:01:33 UTC
19a4dee	piskvorky	16 March 2011, 10:59:09 UTC	removed parsing.tfidf module	16 March 2011, 10:59:09 UTC
c0e2b73	piskvorky	16 March 2011, 09:33:48 UTC	Merge branch 'dedan' into develop Conflicts: .gitignore	16 March 2011, 09:33:48 UTC
5410fc2	Stephan Gabler	09 March 2011, 17:14:12 UTC	Added LogEntropy transformation model with tests and documentation	15 March 2011, 13:27:25 UTC
4bbbe82	piskvorky	14 March 2011, 17:20:17 UTC	fixed some comments	14 March 2011, 17:20:17 UTC
04a4e3b	piskvorky	13 March 2011, 21:47:01 UTC	Fixed TextCorpus unittest I realized when the input is a stream (file-like object), we cannot pickle it. When input is a filename (string), or anything else as long as it's picklable, pickling works ok.	13 March 2011, 21:47:01 UTC
3412851	piskvorky	13 March 2011, 21:20:01 UTC	fixed comments; added forgotten bz2 wrapper	13 March 2011, 21:20:01 UTC

Newer
Older