10e792a | Xavier Grangier | 09 May 2013, 14:41:58 UTC | get parser class from config | 09 May 2013, 14:41:58 UTC |
b03cf04 | Xavier Grangier | 09 May 2013, 13:34:41 UTC | move lxml stuff to parser class | 09 May 2013, 13:34:41 UTC |
e05fcb3 | Xavier Grangier | 09 May 2013, 13:26:46 UTC | move lxml stuff to parser class | 09 May 2013, 13:26:46 UTC |
7f9d52e | Xavier Grangier | 09 May 2013, 10:34:33 UTC | Add a drop_node method to Parser class | 09 May 2013, 10:34:33 UTC |
3e9f156 | Xavier Grangier | 09 May 2013, 09:41:58 UTC | pass config to cleaners init | 09 May 2013, 09:41:58 UTC |
e3fb693 | Xavier Grangier | 03 May 2013, 16:10:59 UTC | Merge pull request #17 from danielmagnussons/master Fixes windows image IOError and sv-stopwords | 03 May 2013, 16:10:59 UTC |
a27f001 | Daniel Magnusson | 03 May 2013, 15:49:32 UTC | write image in binary and sv stop words added ignored env/ dir | 03 May 2013, 15:49:32 UTC |
00cd0e6 | Robert Manson | 23 April 2013, 18:13:29 UTC | make url optional again in extract() | 23 April 2013, 18:13:29 UTC |
8d6eab9 | Robert Manson | 23 April 2013, 01:50:23 UTC | fix issue with canonical link in meta tag when using `raw_html` in the extract method it is possible to end up attempting to parse a None final_url in the article object if the raw_html document has a canonical link meta tag. | 23 April 2013, 01:50:23 UTC |
27834b2 | Xavier Grangier | 07 April 2013, 09:28:12 UTC | reenable tests | 07 April 2013, 09:28:12 UTC |
36a6090 | Xavier Grangier | 07 April 2013, 09:25:11 UTC | move cssselect to Parser class | 07 April 2013, 09:25:11 UTC |
f49a26a | Xavier Grangier | 07 April 2013, 09:23:02 UTC | move cssselect to Parser class | 07 April 2013, 09:23:02 UTC |
b06c6e4 | Xavier Grangier | 04 April 2013, 20:11:25 UTC | missing test file for b68e960 | 04 April 2013, 20:11:25 UTC |
5fc5c40 | Xavier Grangier | 04 April 2013, 20:10:40 UTC | don't replace www in domain if article has no domain | 04 April 2013, 20:10:40 UTC |
b68e960 | Xavier Grangier | 04 April 2013, 20:09:59 UTC | add chinese extractor tests | 04 April 2013, 20:09:59 UTC |
beda8fa | Xavier Grangier | 04 April 2013, 19:52:41 UTC | camelcase less finalUrl | 04 April 2013, 19:52:41 UTC |
456a63d | Xavier Grangier | 04 April 2013, 19:52:01 UTC | camelcase less UrlToCrawl | 04 April 2013, 19:52:01 UTC |
ee00ad2 | Xavier Grangier | 04 April 2013, 19:45:18 UTC | #14 - url kwargs is no more mandatory | 04 April 2013, 19:45:18 UTC |
3185d01 | Xavier Grangier | 02 April 2013, 20:57:08 UTC | updated todo list | 02 April 2013, 20:57:08 UTC |
2f3bd91 | Xavier Grangier | 02 April 2013, 20:53:15 UTC | add travis build status image | 02 April 2013, 20:53:15 UTC |
6c50a4e | Xavier Grangier | 02 April 2013, 20:52:02 UTC | add travis build status image | 02 April 2013, 20:52:02 UTC |
aefec86 | Xavier Grangier | 02 April 2013, 20:47:09 UTC | move tests files to tests/data directory | 02 April 2013, 20:47:09 UTC |
73c63c3 | Xavier Grangier | 02 April 2013, 20:32:33 UTC | test suite and travis yaml | 02 April 2013, 20:32:33 UTC |
e365e70 | Xavier Grangier | 02 April 2013, 17:49:01 UTC | cf #13 - Fixes multiplatform paths | 02 April 2013, 17:49:01 UTC |
2ac0154 | Xavier Grangier | 02 April 2013, 07:04:06 UTC | missong os import | 02 April 2013, 07:04:06 UTC |
44e3c45 | Xavier Grangier | 02 April 2013, 07:01:18 UTC | Add version file and bump to 1.0.0 due to API changes in camelcase less branch | 02 April 2013, 07:01:18 UTC |
19f7d0d | Xavier Grangier | 02 April 2013, 06:52:14 UTC | Misleading variable replacement | 02 April 2013, 06:52:14 UTC |
62b3e08 | Xavier Grangier | 02 April 2013, 06:44:41 UTC | Add a FIXEME flag for windows file path | 02 April 2013, 06:44:41 UTC |
9a17da4 | Xavier Grangier | 02 April 2013, 06:42:39 UTC | Extractor classes camelcase less variables | 02 April 2013, 06:42:39 UTC |
c1a25c5 | Xavier Grangier | 27 March 2013, 07:57:46 UTC | bump to v0.2 | 27 March 2013, 07:57:46 UTC |
fa703b9 | Xavier Grangier | 27 March 2013, 07:55:17 UTC | Image extractor camelcase less methode name | 27 March 2013, 07:55:17 UTC |
27da9a0 | Xavier Grangier | 27 March 2013, 07:47:48 UTC | Image Utils camelcase less | 27 March 2013, 07:47:48 UTC |
ba4a1ee | Xavier Grangier | 27 March 2013, 07:40:47 UTC | Image class camelcase less | 27 March 2013, 07:40:47 UTC |
8416015 | Xavier Grangier | 27 March 2013, 07:31:12 UTC | Text classes camelcase less | 27 March 2013, 07:31:12 UTC |
68665b5 | Xavier Grangier | 27 March 2013, 07:25:49 UTC | OutputFormatter camelcase less | 27 March 2013, 07:25:49 UTC |
01a2a83 | Xavier Grangier | 27 March 2013, 07:19:25 UTC | replace getHtml with get_html | 27 March 2013, 07:19:25 UTC |
3f5a825 | Xavier Grangier | 27 March 2013, 07:19:12 UTC | HtmlFetcher camelcaseless | 27 March 2013, 07:19:12 UTC |
5351ff8 | Xavier Grangier | 26 March 2013, 19:07:54 UTC | Extractor camelcase less variables | 26 March 2013, 19:07:54 UTC |
bc01d42 | Xavier Grangier | 26 March 2013, 18:51:14 UTC | ContentExtractor camelcase less methode name | 26 March 2013, 18:51:14 UTC |
0b2a895 | Xavier Grangier | 26 March 2013, 18:38:41 UTC | fixme notice for os.path.join | 26 March 2013, 18:38:41 UTC |
3321003 | Xavier Grangier | 26 March 2013, 18:36:53 UTC | camelcase Crawler class | 26 March 2013, 18:36:53 UTC |
2cdf0a1 | Xavier Grangier | 26 March 2013, 08:00:41 UTC | Cleaner class camelless variables | 26 March 2013, 08:00:41 UTC |
f1762ef | Xavier Grangier | 26 March 2013, 07:53:27 UTC | Cleaner class camelless methodes name | 26 March 2013, 07:53:27 UTC |
e8f8bc8 | Xavier Grangier | 26 March 2013, 07:42:20 UTC | rawHTML is now raw_html | 26 March 2013, 07:42:20 UTC |
a16e153 | Xavier Grangier | 26 March 2013, 07:40:37 UTC | missing a camelcase args | 26 March 2013, 07:40:37 UTC |
0ee55d5 | Xavier Grangier | 26 March 2013, 07:39:31 UTC | camelcase less Goose class | 26 March 2013, 07:39:31 UTC |
14d381e | Xavier Grangier | 26 March 2013, 07:34:05 UTC | camelcase less Configuration class | 26 March 2013, 07:34:05 UTC |
6c3b46a | Xavier Grangier | 25 March 2013, 21:44:16 UTC | unwanted replacement | 25 March 2013, 21:44:16 UTC |
6f20be9 | Xavier Grangier | 25 March 2013, 18:57:06 UTC | camelcase less Article class | 25 March 2013, 18:57:06 UTC |
1e1ee05 | Xavier Grangier | 25 March 2013, 18:53:44 UTC | camelcase less Article class | 25 March 2013, 18:53:44 UTC |
431eb4a | Xavier Grangier | 25 March 2013, 11:48:33 UTC | rename Video.py to video.py | 25 March 2013, 11:48:33 UTC |
eaebbe5 | Xavier Grangier | 25 March 2013, 11:45:14 UTC | rename ImageUtils.py to utils.py | 25 March 2013, 11:45:14 UTC |
5f7d34a | Xavier Grangier | 25 March 2013, 08:01:58 UTC | mv LocallyStoredImage to image and image extractors to extractors.py | 25 March 2013, 08:01:58 UTC |
c6f6562 | Xavier Grangier | 25 March 2013, 07:51:18 UTC | rename ImageExtractor.py to extractors.py | 25 March 2013, 07:51:18 UTC |
5e57b1f | Xavier Grangier | 25 March 2013, 07:48:05 UTC | ImageDetails class now in image.py | 25 March 2013, 07:48:05 UTC |
a0609df | Xavier Grangier | 25 March 2013, 07:44:21 UTC | too much renaming | 25 March 2013, 07:44:21 UTC |
4b85429 | Xavier Grangier | 25 March 2013, 07:42:19 UTC | rename images/Image.py | 25 March 2013, 07:42:19 UTC |
4c29e2e | Xavier Grangier | 25 March 2013, 07:39:00 UTC | rename Crawler.py | 25 March 2013, 07:39:00 UTC |
6e9e21b | Xavier Grangier | 25 March 2013, 07:36:49 UTC | mv Goose.py content to __init__.py | 25 March 2013, 07:36:49 UTC |
01d8503 | Xavier Grangier | 25 March 2013, 07:33:05 UTC | rename Configuration.py | 25 March 2013, 07:33:05 UTC |
4b4f18e | Xavier Grangier | 25 March 2013, 07:29:22 UTC | rename Article.py | 25 March 2013, 07:29:22 UTC |
6e76dc3 | Xavier Grangier | 24 March 2013, 22:16:57 UTC | python 2.6 doesn't support assertIsInstance | 24 March 2013, 22:16:57 UTC |
48f03c5 | Xavier Grangier | 24 March 2013, 22:15:28 UTC | python 2.6 doesn't support assertIsNotNone | 24 March 2013, 22:15:28 UTC |
1b72703 | Xavier Grangier | 23 March 2013, 18:33:36 UTC | update THANKS file | 23 March 2013, 18:33:36 UTC |
fbd1e42 | Xavier Grangier | 23 March 2013, 18:31:46 UTC | update THANKS file | 23 March 2013, 18:31:46 UTC |
612aded | Xavier Grangier | 23 March 2013, 18:26:16 UTC | missing import | 23 March 2013, 18:26:16 UTC |
aa86d4b | Xavier Grangier | 23 March 2013, 18:24:58 UTC | cross platform filepath handeling | 23 March 2013, 18:24:58 UTC |
374e00b | Xavier Grangier | 23 March 2013, 18:17:09 UTC | cross platform filepath handeling | 23 March 2013, 18:17:09 UTC |
ca73bbf | Xavier Grangier | 22 March 2013, 19:05:26 UTC | Missing parentheses added in isOkToBoost | 22 March 2013, 19:05:26 UTC |
eeeb68b | Xavier Grangier | 22 March 2013, 19:02:18 UTC | Merge branch 'master' of github.com:xgdlm/python-goose | 22 March 2013, 19:02:18 UTC |
de5ce5a | Xavier Grangier | 22 March 2013, 19:01:56 UTC | add sup tag to replaceTagsWithText methode | 22 March 2013, 19:01:56 UTC |
133fc3e | Xavier Grangier | 24 February 2013, 20:05:51 UTC | Merge pull request #10 from timjurka/master Removing set() Usage in Content Clusterer | 24 February 2013, 20:05:51 UTC |
690663d | Tim Jurka | 13 February 2013, 01:00:39 UTC | Don't want to use set() in content extractor, because DOM elements get reordered. | 13 February 2013, 01:00:39 UTC |
e36ff1e | Xavier Grangier | 10 December 2012, 16:20:53 UTC | Clean only current thread tmp files, add timestamp for multithreading | 10 December 2012, 16:20:53 UTC |
ba41090 | Xavier Grangier | 10 December 2012, 15:58:03 UTC | Update README.md | 10 December 2012, 15:58:03 UTC |
39384a5 | Xavier Grangier | 10 December 2012, 12:25:42 UTC | Missing jieba package | 10 December 2012, 12:25:42 UTC |
f1064ad | Xavier Grangier | 10 December 2012, 11:26:37 UTC | Adds Chinese stopwords analyser, and enable to pass a stopword analyser to config object | 10 December 2012, 11:26:37 UTC |
ff20ac2 | Xavier Grangier | 10 December 2012, 10:17:48 UTC | Add v0idnull to contributors | 10 December 2012, 10:17:48 UTC |
a932d96 | Xavier Grangier | 10 December 2012, 10:16:29 UTC | Merge pull request #8 from v0idnull/master Debug flag | 10 December 2012, 10:16:29 UTC |
ef18ab2 | v0idnull | 10 December 2012, 00:40:45 UTC | HTTP Debug mode now dependent on configuration | 10 December 2012, 00:40:45 UTC |
bb83da5 | v0idnull | 10 December 2012, 00:40:07 UTC | Added debug flag to configuration | 10 December 2012, 00:40:07 UTC |
1db85f9 | Xavier Grangier | 08 December 2012, 14:21:37 UTC | Merge branch 'master' of github.com:xgdlm/python-goose | 08 December 2012, 14:21:37 UTC |
f431d51 | Xavier Grangier | 08 December 2012, 14:20:36 UTC | release resources | 08 December 2012, 14:20:36 UTC |
d9bbb46 | Xavier Grangier | 01 November 2012, 10:18:04 UTC | Merge pull request #5 from dzen/master Fix missing dependancy thanks | 01 November 2012, 10:18:04 UTC |
758243f | Benoit Calvez | 31 October 2012, 09:42:39 UTC | setup.py: Missing dependancy on cssselect | 31 October 2012, 09:42:39 UTC |
94f0c62 | Xavier Grangier | 30 October 2012, 22:35:55 UTC | Merge branch 'master' of github.com:xgdlm/python-goose | 30 October 2012, 22:35:55 UTC |
68ca62b | Xavier Grangier | 30 October 2012, 22:33:52 UTC | hashlib.md5 doesn't support unicode, use str instead | 30 October 2012, 22:33:52 UTC |
e2a0962 | Xavier Grangier | 29 October 2012, 10:42:52 UTC | Typo thanks to brutasse | 29 October 2012, 10:42:52 UTC |
5e24b7d | Xavier Grangier | 28 October 2012, 12:08:48 UTC | Better configuration and usage instruction | 28 October 2012, 12:08:48 UTC |
dc3f762 | Xavier Grangier | 28 October 2012, 12:07:56 UTC | Missing commit for Language support | 28 October 2012, 12:07:56 UTC |
e28fff4 | Xavier Grangier | 28 October 2012, 11:50:23 UTC | Use the correct stopword file in regard of meta language stopswords are really important to check words and paragraphe density goose will now try to fetch the correct stop word file. It's also possible to force the target language using configuration | 28 October 2012, 11:50:23 UTC |
5ef43c5 | Xavier Grangier | 27 October 2012, 20:23:18 UTC | adds Parsely | 27 October 2012, 20:23:18 UTC |
e6cb957 | Xavier Grangier | 27 October 2012, 20:22:25 UTC | Cache the list of stop words (per language). This avoids re-reading all of the stop words from disk continuously. Tanks to Parsely | 27 October 2012, 20:22:25 UTC |
fbd1269 | Xavier Grangier | 27 October 2012, 20:18:27 UTC | Only create the trans table once and reuse it. | 27 October 2012, 20:18:27 UTC |
3cb0f4b | Xavier Grangier | 27 October 2012, 20:30:17 UTC | pep8 | 27 October 2012, 20:30:17 UTC |
7261316 | Xavier Grangier | 27 October 2012, 19:41:07 UTC | Make the precompiled PUNCTUATION regex actually reusable. | 27 October 2012, 19:41:07 UTC |
16de610 | Xavier Grangier | 27 October 2012, 19:32:43 UTC | Better MANIFEST.in | 27 October 2012, 19:32:43 UTC |
c60bb53 | Xavier Grangier | 27 October 2012, 19:23:49 UTC | Remove useless module | 27 October 2012, 19:23:49 UTC |
7d88d01 | Xavier Grangier | 27 October 2012, 19:15:52 UTC | Remove "facebook_broadcasting" junk Remove the "facebook-broadcasting" div which would lead to 'Click "Add to Timeline" to publish what you read to Facebook' becoming the article text. This issue was surfacing while looking at CBS Local affiliate site support. | 27 October 2012, 19:15:52 UTC |
e5e2869 | Xavier Grangier | 27 October 2012, 19:09:28 UTC | Update include better OS X installation instructions | 27 October 2012, 19:09:28 UTC |