https://github.com/RaRe-Technologies/gensim

sort by:
Revision Author Date Message Commit Date
90ac214 Implements loss and gradients with modified objective 01 November 2017, 10:25:50 UTC
f22d9b2 Prints average loss every few iterations instead of current loss 31 October 2017, 11:31:30 UTC
9c51609 Fixes typo in clip_vectors 30 October 2017, 17:38:50 UTC
0c57aa1 Merge branch 'poincare' into poincare_model 30 October 2017, 09:56:25 UTC
2a5a7fb Minor correction in clipping 30 October 2017, 09:47:20 UTC
3ea3730 Merge pull request #1643 from RaRe-Technologies/poincare_eval Evaluation of existing Poincaré embedding implementations 30 October 2017, 09:46:38 UTC
71f61d1 Adds batch-wise implementation of training and gradient computations 27 October 2017, 21:11:09 UTC
ba82d42 Simply sets nan gradients to zero instead of nan_to_num 27 October 2017, 13:08:22 UTC
7d68aae Only calls nan_to_num when gamma has at least one value equal to 1 27 October 2017, 13:00:40 UTC
3b2a383 Avoids creating copies of numpy vectors 27 October 2017, 10:51:46 UTC
e1ed24d Avoids doing some numpy computations twice 27 October 2017, 10:51:46 UTC
d439501 Compares computed gradients to autograd gradients every few iterations 27 October 2017, 10:51:38 UTC
d72cb10 Renames PoincareDistance to PoincareExample for clarity 27 October 2017, 06:14:52 UTC
2e9e31c Better messages while training 26 October 2017, 22:18:58 UTC
99a2270 Fixes error in gradient computation 26 October 2017, 22:09:42 UTC
3e28e8b Correct implementation of clipping of updated vectors 26 October 2017, 20:13:26 UTC
e286a0b Adds calculation of gradients for poincare model 26 October 2017, 20:13:26 UTC
1e6aee1 minor changes to batch poincare distance computation 26 October 2017, 20:13:26 UTC
b727523 batched gradient descent initial implementation 26 October 2017, 20:13:26 UTC
98f94a7 allows poincare dist function to be differentiable by autograd 26 October 2017, 20:13:26 UTC
6bd0d4b faster negative sampling, bugfix in vector updates 26 October 2017, 20:13:26 UTC
a804006 Initial implementation of training using autograd 26 October 2017, 20:13:26 UTC
6afdd22 Initial classes and loading data for poincare model 26 October 2017, 20:13:26 UTC
99089a5 Doesnt load all models into memory at once 26 October 2017, 19:31:18 UTC
1e7ddd8 Adds poincare nb requirements, moves imports to beginning 26 October 2017, 19:31:18 UTC
e80a834 Minor fixes to poincare eval notebook 26 October 2017, 19:31:18 UTC
17390ac Adds results of numpy poincare embeddings on link prediction, minor improvements 26 October 2017, 19:31:18 UTC
c07d582 Adds cleaner setup 26 October 2017, 19:31:18 UTC
7cbd6b9 Adds patch for external numply implementation to repo 26 October 2017, 19:31:18 UTC
5d72642 Adds code for training and loading external numpy models, results 26 October 2017, 19:31:18 UTC
53fcf23 More readable results 26 October 2017, 19:31:18 UTC
d7f1840 Corrects implementation of MAP, updated results 26 October 2017, 19:31:18 UTC
0d593f0 Adds initial implementation of MAP and MAP scores 26 October 2017, 19:31:18 UTC
a415c65 Adds results for all models on link prediction 26 October 2017, 19:31:18 UTC
1f15aeb Minor fixes to poincare nb - change in variable name, relative path, misaligned header 26 October 2017, 19:31:18 UTC
c062814 Adds patch file to setup and all results 26 October 2017, 19:31:18 UTC
efd1fe0 Adds patch for C++ poincare implementation 26 October 2017, 19:31:18 UTC
c17846f Implements link prediction task for poincare and adds results 26 October 2017, 19:31:18 UTC
69b4d61 Adds setup and training steps to notebook, tabulated results 26 October 2017, 19:31:18 UTC
fd86c32 Adds complete optimized evaluation of lexical entailment to notebook 26 October 2017, 19:31:18 UTC
0a06fd5 Adds initial evaluation for lexical entailment on HyperLex 26 October 2017, 19:31:18 UTC
9a511a7 More efficient computation of mean rank for graph reconstruction 26 October 2017, 19:31:18 UTC
51bd7ab Adds initial poincare evaluation notebook 26 October 2017, 19:31:18 UTC
a068cbe Fix deprecation warnings for regex string literals. Fix #1646 (#1649) * Fix deprecation warnings for regex string literals. Fix #1646 Add raw flag before all Regex strings so Python 3 can stop complaining. * Fix two more occurrences of unescaped Regex strings 26 October 2017, 12:00:43 UTC
00192a8 Fix pagerank algorithm. Fix #805 (#1653) * added a regression test for summarization.keywords() * handled case with graph smaller than 3 nodes * removed TODO about complex eigenvectors * added more comments 26 October 2017, 11:08:40 UTC
b912203 Drop Win x32 support & add 'rolling builds' (#1652) * disable x32 builds * add rolling build 26 October 2017, 05:53:41 UTC
67d9634 Fix code/docstring style (#1650) * replace open->smart_open in annoy tutorial * style fixes for lda model diff * fix for #1390 * fix for #1423 * fix doc in Phrases 25 October 2017, 18:34:27 UTC
9481915 Improve error message for supervised fastText. Fix #1498 (#1645) 24 October 2017, 13:25:11 UTC
a5872fa Fix scoring function in Phrases. Fix #1533, #1635 (#1573) * initial commit of fixes in comments of #1423 * removed unnecessary space in logger * added support for custom Phrases scorers * fixed Phrases.__getitem__ to support pluggable scoring #1533 * travisCI style fixes * fixed __next__() to next() for python 3 compatibilyt * misc fixes * spacing fixes for style * custom scorer support in sklearn api * Phrases scikit interface tests for pluggable scoring * missing line breaks * style, clarity, and robustness fixes requested by @piskvorky * check in Phrases init to make sure scorer is pickleable * backwards scoring compatibility when loading a Phrases class * removal of pickle testing objects in Phrases init * switched to six for python 2/3 compatibility * fix docstring 24 October 2017, 12:22:54 UTC
7f23a2c Fix FastText inconsistent dtype. Fix #1637 (#1638) 24 October 2017, 11:22:14 UTC
8097cad Add configuration for flake8 to setup.cfg (#1636) 24 October 2017, 11:10:09 UTC
9266aba Fix test_filename_filtering test (#1647) CI tests fail with: ====================================================================== FAIL: test_filename_filtering (gensim.test.test_corpora.TestTextDirectoryCorpus) ---------------------------------------------------------------------- Traceback (most recent call last): File ".../lib/python3.6/site-packages/gensim/test/test_corpora.py", line 462, in test_filename_filtering self.assertEqual(expected, filenames) AssertionError: Lists differ: ['/tmp/tmp0j1tou_7/test1.log', '/tmp/tmp0j1tou_7/test2.log'] != ['/tmp/tmp0j1tou_7/test2.log', '/tmp/tmp0j1tou_7/test1.log'] It's not a real failure, since the files are correct, only their order of comparison is not same 24 October 2017, 06:27:38 UTC
58b30d7 Add "DOI badge" to README page. Fix #1610 (#1639) * Add "DOI badge" to gensim #1610 * reorder badges 21 October 2017, 10:59:18 UTC
047ab12 Remove duplicate notebook. Fix #1415 (#1640) 21 October 2017, 10:54:16 UTC
e92b45d Add build_vocab_from_freq to Word2Vec, speedup scan_vocab (#1599) * fix build vocab speed issue, and new function to build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * fix build vocab speed issue, function build vocab from previously provided word frequencies table * Removing the extra blank lines, documentation in numpy-style to build_vocab_from_freq, and hanging indents in build_vocab * Fixing Indentation * Fixing gensim/models/word2vec.py:697:1: W293 blank line contains whitespace * Remove trailing white spaces * Adding test * fix spaces 19 October 2017, 06:37:41 UTC
1a1fc44 Fix duplication and wrong markup in docs (#1633) * Fixed build of docs: - duplication of the citates from word2vec and doc2vec, - wrong markup of lists in the scripts, - some typos. * Add missing 'tensor' word 18 October 2017, 07:05:40 UTC
2690289 Add "most_similar_to_given" method for KeyedVectors (#1582) * finished adding 2 new functions * imported argmax to word2vec * reformatted * remove `most_similar_to_given` from w2v class * Fix PEP8 17 October 2017, 05:51:34 UTC
1c7e72f Refactor dendrogram & topic network notebooks (#1571) * remove plotly's dendrogram code in notebooks * pin plotly, re-run notebooks 16 October 2017, 11:18:40 UTC
e9bbcf3 Remove unnecessary assert blocking direct usage of CSC for LSI (#1622) 16 October 2017, 07:01:46 UTC
9166de2 Fix release badge (#1631) 16 October 2017, 06:38:00 UTC
16b812c Add dtype support for LSI (#1620) * Enable float32 for LSI - stochastic SVD * Fix PEP8 issue * - Add testTransformFloat32 - fix float32 for one-pass LSI 13 October 2017, 07:35:59 UTC
44b0403 Add __getitem__ method to Sparse2Corpus to allow direct queries (#1621) * Add __getitem__ method to Sparse2Corpus to allow direct queries * Fix PEP8 * Add docstring for Sparse2Corpus.__getitem__ 13 October 2017, 06:19:39 UTC
9a6d78c Merge branch 'master' into develop 12 October 2017, 08:45:34 UTC
86e0618 Merge branch 'release-3.0.1' 12 October 2017, 08:44:33 UTC
1c26225 update changelog to 3.0.1 12 October 2017, 08:43:38 UTC
90e9a43 bump version to 3.0.1 12 October 2017, 05:53:28 UTC
b0f80a6 Fix spelling (#1625) 11 October 2017, 18:24:07 UTC
c220166 Fix Keras import, speedup importing time. Fix #1614 (#1615) * Move Keras import to get_embedding_layer. * rename `get_embedding_layer ` as `get_keras_embedding` 06 October 2017, 10:26:06 UTC
0ef1ece Fix sphinx warnings and retrieve all missing .rst (#1612) * Fix typo * Make `save_corpus` private * Annotate `bleicorpus.py` * Make __save_corpus weakly private * Fix _save_corpus in tests * Fix _save_corpus[2] * Fix relativly obvious sphinx warnings * Fix sphinx warnings * Revert "Fix sphinx warnings" This reverts commit 8c00de8fc4a09bc9e3597eb6d8a95363543634f8. * Revert "Fix relativly obvious sphinx warnings" This reverts commit 7fbdf5db550be685d9b498a4da6a83e3c1e9b23d. * Revert "Fix _save_corpus[2]" This reverts commit b65a69a4b0313a7670b620a28411478ed8715cca. * Revert "Fix _save_corpus in tests" This reverts commit 69fc7e04a1c82cc7b72be231bbd3df207f50fe0b. * Revert "Make __save_corpus weakly private" This reverts commit 342811371b368315786ac8097a90e6612bba9e45. * Revert "Annotate `bleicorpus.py`" This reverts commit 981ebbbbabcf95ae7e2629266bcfb7d9931b7694. * Revert "Make `save_corpus` private" This reverts commit 36d98d11eb464ed74f7e6c22b45adbec7e5618e0. * Revert "Fix typo" This reverts commit b260d4b07114b1c449292cda492a0842b19445ce. * Revert "Revert "Fix relativly obvious sphinx warnings"" This reverts commit b4dddb3ca491a6d18ff470437e05d36adcd0c185. * Revert "Revert "Fix sphinx warnings"" This reverts commit ca3d216844b4818d74cf4be1e9878f006eb957c4. * fix PEP8 * fix last sphinx warnings * add missing submodules to reference * add missing *.rst * fix new warnings * add [docs] deps for building + remove [wmd] * add doc build to travis * fix PEP8 05 October 2017, 11:37:31 UTC
96d230a Fix logger message in lsi_dispatcher (#1603) * Fix logger message typo in lsi_dispatcher * small fix 02 October 2017, 08:58:39 UTC
36a5cb9 Merge branch 'release-3.0.0' into develop 27 September 2017, 08:58:31 UTC
351bdef Merge branch 'release-3.0.0' 27 September 2017, 08:57:25 UTC
af646c4 update changelog to 3.0.0 27 September 2017, 08:51:30 UTC
aab74b7 regenerated C files with Cython 27 September 2017, 08:31:23 UTC
c9d1e88 bump version to 3.0.0 27 September 2017, 08:30:11 UTC
0a2c05d Fix typo in translation_matrix notebook (#1598) 26 September 2017, 14:46:23 UTC
33a3ef2 Fix Translation Matrix (#1594) * fix the comments * remove print function * update the notebook * fix the train method * remove some words for sample * fix the tense * add warning for the translation matrix revist part 25 September 2017, 07:18:00 UTC
09fddf5 correct PathLineSentences comment 20 September 2017, 18:02:21 UTC
6e51156 Add unsupervised FastText to Gensim (#1525) * added initial code for CBOW * updated unit tests for fasttext * corrected use of matrix and precomputed ngrams for vocab words * added EOS token in 'LineSentence' class * added skipgram training code * updated unit tests for fasttext * seeded 'np.random' with 'self.seed' * added test for persistence * updated seeding numpy obj * updated (unclean) fasttext code for review * updated fasttext tutorial notebook * added 'save' and 'load_fasttext_format' functions * updated unit tests for fasttext * cleaned main fasttext code * updated unittests * removed EOS token from LineSentence * fixed flake8 errors * [WIP] added online learning * added tests for online learning * flake8 fixes * refactored code to remove redundancy * reusing 'word_vec' from 'FastTextKeyedVectors' * flake8 fixes * split 'syn0_all' into 'syn0_vocab' and 'syn0_ngrams' * removed 'init_wv' param from Word2Vec * updated unittests * flake8 errors fixed * fixed oov word_vec * updated test_training unittest * Fix broken merge * useless change (need to re-run Appveyour) * Add skipIf for Appveyor x32 (avoid memory error) 19 September 2017, 08:17:54 UTC
5a49a79 Fix doctag unicode problem. Fix 1543 (#1544) * Fix doctag unicode * Add test for unicode doctags. * Fix doc2vec unicode title test. * Make the unicode tag cast less hidden. 19 September 2017, 05:07:59 UTC
2e58a1c Update WikiCorpus tokenization. Fix #1534 (#1537) * code to better handle tokenization Adding the ability to define: 1. Define min and max token length 2. Define min number of tokens for valid articles 3. Call a custom function to handle tokenization with the configured parameter on the class instance 4. Control if lowercase is desired * adding another test case adding a test case to check "lower" parameter with the custom tokenizer * cleaning up code * clean up code for formatting * cleaning up indentation * missing backtick 18 September 2017, 15:17:09 UTC
02ba343 Add verification when summarize_corpus returns null. Fix #1531. (#1570) * Avoid "NoneType is not iterable..." error for few documents in corpus. * Fix comment. * Adding relevant test. * Fixed return types on summarization border cases: - Returns empty list on border case of summarize_corpus. - Returns empty string or empty list on border case of summarize. - Fixed test accordingly. - Removed some test code repetition. * Replace `is` to `==` 18 September 2017, 10:35:56 UTC
4c0737a Add word2vec-based coherence (#1530) * #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * Update `test_coherencemodel` to skip Mallet and Vowpal Wabbit tests if the executables are not installed, instead of passing them inappropriately. * Fix trailing whitespace. * Add `get_topics` method to `BaseTopicModel` and update notebook for new Word2Vec-based coherence metric "c_w2v". * Add several helper methods to the `CoherenceModel` for comparing a set of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it. * fix flake8 whitespace issues * fix order of imports in `corpora.__init__` * fix corpora.__init__ import order * push fix for setting `topn` in `CoherenceModel.for_topics` * Use `dict.pop` in place of checking and optionally getting and deleting topn in `CoherenceModel.for_topics`. * fix non-deterministic test failure in `test_coherencemodel` * Update coherence model selection notebook to use sklearn dataset loader to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`. 18 September 2017, 08:36:29 UTC
6b8f1c0 Add comment explaining lack of multistream support (#1515) * Add comment explaining lack of multistream support See #1496, looks like this has confused some people. -POLM * Add file patterns to documentation 18 September 2017, 08:30:13 UTC
e667069 Fix incorrect initialization ShardedCorpus with a generator. Fix #1511 (#1512) Fix incorrect initialization ShardedCorpus with a generator. Fix #1511. 14 September 2017, 11:01:37 UTC
1c0098c Add TranslationMatrix model (for word2vec and paragraph2vec) (#1434) [MRG] Implement 'Translation Matrix' 13 September 2017, 12:38:25 UTC
224566c Improve speed of FastTextKeyedVectors.__contains__ (#1499) * Improve speed of FastTextKeyedVectors __contains__ The current implementation of __contains__ in FastTextKeyedVectors is `O(n*m)` where `n` is the number of character ngrams in the query word and `m` is the size of the vocabulary. This is very slow for large corpora. The new implementation is O(n). * any() was unnecessary. * Update variable name and docstring to improve clarity 11 September 2017, 10:01:35 UTC
db9e230 Refactor code with PEP8 and additional limitations. Fix #1521 (#1569) * Replace map(..) to comprehensions * Fix logging (remove '%'/'.format' + longer lines) * style-check[1] * Small fix for bash scripts * style-check[2] (corpora) * flake8 check * Fix shared_corpus API + resolve comment from review * Remove legacy "endclass" from corpora * style-check[3] * style-check[4] * Rename test_base_tm to basetmtest (for preventing direct running with nose) + small changes for models * style-check[5] * Replace LOG -> logger * Return broad exception to dictionary * Replace "dict((" -> dict comprehension * Replace "print(e)" -> "logger.exception(e)" * Fix quotation * Reduce long lines * missed PEP8 * style-check[6] 08 September 2017, 19:10:24 UTC
6d6f5dc Refactor all python code by PEP8. Partially fix #1521 (#1550) * gensim dir PEP8 fixes * corpora dir PEP8 fixes * example dir PEP8 fixes * model/wrapper dir PEP8 fixes * models dir PEP8 fixes * parsing dir PEP8 fixes * scripts dir PEP8 fixes * similarities dir PEP8 fixes * summarization and topic_coherence dir PEP8 fixes * test dir PEP8 fixes * PEP8 E722 error fixes * PEP8 fixes * list slice whitespace PEP8 fixes * disassemble import * * Fix symlink * fix symlink * fix make_wiki_lemma file * Replace relative import to absolute * fix typo * fix E203 error 05 September 2017, 14:58:07 UTC
13578d4 Add AppVeyor for all builds (#1565) * init run * rm dup * remove buggy test 05 September 2017, 10:40:34 UTC
32e0257 Fix mutable args in methods definition (#1562) * Change empty list args * Change empty dict args * Fix spaces 04 September 2017, 14:16:35 UTC
ed0b03e Add style-checking for notebooks & refactor Travis config (#1522) * add installation script for env * Add run script for test/codestyle * modify travis file * fix misprints * add pytest * Add basic version checking * try to fix FAST_VERSION==-1 * try to fix FAST_VERSION==-1[2] * remove debug info * fix flake8 problems with ignore list * continue with flake (break pep8 in matutils) * fix regexp for grep * restore matutils * echo files for flake8 * Add ipynb checking * special mistakes in .ipynb for testing purposes * Distinct file checking for ipynb * remove mistakes from notebooks 04 September 2017, 10:03:57 UTC
3d2227d Set trainable flag in get_embedding_layer. Fix 1557 (#1558) 01 September 2017, 15:22:54 UTC
9caf055 Fix Mallet wrapper and tests for HDPTransform (#1555) * fix type in mallet wrapper * fix tests for sklearn wrapper * debug commit for test * fix seeding and precision * fix pep8 & try to fix unreproducable error * debug unreproduced error * fix test * remove debug output 01 September 2017, 13:36:00 UTC
26b285e Add the Google Tag Manager (TGM) (#1556) * Update layout.html - removed old Google Analytics code - added two snippets for Google Tag Manager (GTM), one in head, the other in body * Update layout.html - removed old Google Analytics code (Urchin) - added code for Google Tag Manager - one in head, the other in body * Ignore *.html for flake8 01 September 2017, 09:28:03 UTC
26cd87a Update Doc2vec-IDMB notebook (#1476) * Added introduction, motivation, etc. and cleaned up Doc2vec-IDMB notebook * Fixed a syntax error 31 August 2017, 07:49:51 UTC
ae31c0c Add callback metrics interface for LdaModel and integration with Visdom (#1399) * save log params in a dict * remove redundant line * add diff log * remove diff log * write params to log directory * add convergence, remove alpha * calculate perplexity/diff instead of using log function * add docstrings and comments * add coherence/diff labels in graphs * optional measures for viz * add coherence params to lda init * added Lda Visom viz notebook * add option to specify env * made requested changes * add generic callback API * modified Notebook for new API * fix flake8 * correct lee corpus division * added docstrings * fix flake8 * add shell example * fix queue import for both py2/py3 * store metrics in model instance * add nb example for getting metrics after train * made rquested changes * use dict for saving metrics * use str method for metric classes * correct a notebook description * remove child-classes str method * made requested changes * add visdom screenshot 30 August 2017, 11:09:51 UTC
1a73e4f Add Capital One to Adopters page (#1552) 28 August 2017, 12:01:22 UTC
1764e69 Replace viewitems() to iteritems(). Fix 1495 (#1508) 25 August 2017, 11:04:39 UTC
5cefaef Remove extra filter_token from tutorial (#1502) 25 August 2017, 10:56:50 UTC
back to top