https://github.com/grangier/python-goose

sort by:
Revision Author Date Message Commit Date
10e792a get parser class from config 09 May 2013, 14:41:58 UTC
b03cf04 move lxml stuff to parser class 09 May 2013, 13:34:41 UTC
e05fcb3 move lxml stuff to parser class 09 May 2013, 13:26:46 UTC
7f9d52e Add a drop_node method to Parser class 09 May 2013, 10:34:33 UTC
3e9f156 pass config to cleaners init 09 May 2013, 09:41:58 UTC
e3fb693 Merge pull request #17 from danielmagnussons/master Fixes windows image IOError and sv-stopwords 03 May 2013, 16:10:59 UTC
a27f001 write image in binary and sv stop words added ignored env/ dir 03 May 2013, 15:49:32 UTC
00cd0e6 make url optional again in extract() 23 April 2013, 18:13:29 UTC
8d6eab9 fix issue with canonical link in meta tag when using `raw_html` in the extract method it is possible to end up attempting to parse a None final_url in the article object if the raw_html document has a canonical link meta tag. 23 April 2013, 01:50:23 UTC
27834b2 reenable tests 07 April 2013, 09:28:12 UTC
36a6090 move cssselect to Parser class 07 April 2013, 09:25:11 UTC
f49a26a move cssselect to Parser class 07 April 2013, 09:23:02 UTC
b06c6e4 missing test file for b68e960 04 April 2013, 20:11:25 UTC
5fc5c40 don't replace www in domain if article has no domain 04 April 2013, 20:10:40 UTC
b68e960 add chinese extractor tests 04 April 2013, 20:09:59 UTC
beda8fa camelcase less finalUrl 04 April 2013, 19:52:41 UTC
456a63d camelcase less UrlToCrawl 04 April 2013, 19:52:01 UTC
ee00ad2 #14 - url kwargs is no more mandatory 04 April 2013, 19:45:18 UTC
3185d01 updated todo list 02 April 2013, 20:57:08 UTC
2f3bd91 add travis build status image 02 April 2013, 20:53:15 UTC
6c50a4e add travis build status image 02 April 2013, 20:52:02 UTC
aefec86 move tests files to tests/data directory 02 April 2013, 20:47:09 UTC
73c63c3 test suite and travis yaml 02 April 2013, 20:32:33 UTC
e365e70 cf #13 - Fixes multiplatform paths 02 April 2013, 17:49:01 UTC
2ac0154 missong os import 02 April 2013, 07:04:06 UTC
44e3c45 Add version file and bump to 1.0.0 due to API changes in camelcase less branch 02 April 2013, 07:01:18 UTC
19f7d0d Misleading variable replacement 02 April 2013, 06:52:14 UTC
62b3e08 Add a FIXEME flag for windows file path 02 April 2013, 06:44:41 UTC
9a17da4 Extractor classes camelcase less variables 02 April 2013, 06:42:39 UTC
c1a25c5 bump to v0.2 27 March 2013, 07:57:46 UTC
fa703b9 Image extractor camelcase less methode name 27 March 2013, 07:55:17 UTC
27da9a0 Image Utils camelcase less 27 March 2013, 07:47:48 UTC
ba4a1ee Image class camelcase less 27 March 2013, 07:40:47 UTC
8416015 Text classes camelcase less 27 March 2013, 07:31:12 UTC
68665b5 OutputFormatter camelcase less 27 March 2013, 07:25:49 UTC
01a2a83 replace getHtml with get_html 27 March 2013, 07:19:25 UTC
3f5a825 HtmlFetcher camelcaseless 27 March 2013, 07:19:12 UTC
5351ff8 Extractor camelcase less variables 26 March 2013, 19:07:54 UTC
bc01d42 ContentExtractor camelcase less methode name 26 March 2013, 18:51:14 UTC
0b2a895 fixme notice for os.path.join 26 March 2013, 18:38:41 UTC
3321003 camelcase Crawler class 26 March 2013, 18:36:53 UTC
2cdf0a1 Cleaner class camelless variables 26 March 2013, 08:00:41 UTC
f1762ef Cleaner class camelless methodes name 26 March 2013, 07:53:27 UTC
e8f8bc8 rawHTML is now raw_html 26 March 2013, 07:42:20 UTC
a16e153 missing a camelcase args 26 March 2013, 07:40:37 UTC
0ee55d5 camelcase less Goose class 26 March 2013, 07:39:31 UTC
14d381e camelcase less Configuration class 26 March 2013, 07:34:05 UTC
6c3b46a unwanted replacement 25 March 2013, 21:44:16 UTC
6f20be9 camelcase less Article class 25 March 2013, 18:57:06 UTC
1e1ee05 camelcase less Article class 25 March 2013, 18:53:44 UTC
431eb4a rename Video.py to video.py 25 March 2013, 11:48:33 UTC
eaebbe5 rename ImageUtils.py to utils.py 25 March 2013, 11:45:14 UTC
5f7d34a mv LocallyStoredImage to image and image extractors to extractors.py 25 March 2013, 08:01:58 UTC
c6f6562 rename ImageExtractor.py to extractors.py 25 March 2013, 07:51:18 UTC
5e57b1f ImageDetails class now in image.py 25 March 2013, 07:48:05 UTC
a0609df too much renaming 25 March 2013, 07:44:21 UTC
4b85429 rename images/Image.py 25 March 2013, 07:42:19 UTC
4c29e2e rename Crawler.py 25 March 2013, 07:39:00 UTC
6e9e21b mv Goose.py content to __init__.py 25 March 2013, 07:36:49 UTC
01d8503 rename Configuration.py 25 March 2013, 07:33:05 UTC
4b4f18e rename Article.py 25 March 2013, 07:29:22 UTC
6e76dc3 python 2.6 doesn't support assertIsInstance 24 March 2013, 22:16:57 UTC
48f03c5 python 2.6 doesn't support assertIsNotNone 24 March 2013, 22:15:28 UTC
1b72703 update THANKS file 23 March 2013, 18:33:36 UTC
fbd1e42 update THANKS file 23 March 2013, 18:31:46 UTC
612aded missing import 23 March 2013, 18:26:16 UTC
aa86d4b cross platform filepath handeling 23 March 2013, 18:24:58 UTC
374e00b cross platform filepath handeling 23 March 2013, 18:17:09 UTC
ca73bbf Missing parentheses added in isOkToBoost 22 March 2013, 19:05:26 UTC
eeeb68b Merge branch 'master' of github.com:xgdlm/python-goose 22 March 2013, 19:02:18 UTC
de5ce5a add sup tag to replaceTagsWithText methode 22 March 2013, 19:01:56 UTC
133fc3e Merge pull request #10 from timjurka/master Removing set() Usage in Content Clusterer 24 February 2013, 20:05:51 UTC
690663d Don't want to use set() in content extractor, because DOM elements get reordered. 13 February 2013, 01:00:39 UTC
e36ff1e Clean only current thread tmp files, add timestamp for multithreading 10 December 2012, 16:20:53 UTC
ba41090 Update README.md 10 December 2012, 15:58:03 UTC
39384a5 Missing jieba package 10 December 2012, 12:25:42 UTC
f1064ad Adds Chinese stopwords analyser, and enable to pass a stopword analyser to config object 10 December 2012, 11:26:37 UTC
ff20ac2 Add v0idnull to contributors 10 December 2012, 10:17:48 UTC
a932d96 Merge pull request #8 from v0idnull/master Debug flag 10 December 2012, 10:16:29 UTC
ef18ab2 HTTP Debug mode now dependent on configuration 10 December 2012, 00:40:45 UTC
bb83da5 Added debug flag to configuration 10 December 2012, 00:40:07 UTC
1db85f9 Merge branch 'master' of github.com:xgdlm/python-goose 08 December 2012, 14:21:37 UTC
f431d51 release resources 08 December 2012, 14:20:36 UTC
d9bbb46 Merge pull request #5 from dzen/master Fix missing dependancy thanks 01 November 2012, 10:18:04 UTC
758243f setup.py: Missing dependancy on cssselect 31 October 2012, 09:42:39 UTC
94f0c62 Merge branch 'master' of github.com:xgdlm/python-goose 30 October 2012, 22:35:55 UTC
68ca62b hashlib.md5 doesn't support unicode, use str instead 30 October 2012, 22:33:52 UTC
e2a0962 Typo thanks to brutasse 29 October 2012, 10:42:52 UTC
5e24b7d Better configuration and usage instruction 28 October 2012, 12:08:48 UTC
dc3f762 Missing commit for Language support 28 October 2012, 12:07:56 UTC
e28fff4 Use the correct stopword file in regard of meta language stopswords are really important to check words and paragraphe density goose will now try to fetch the correct stop word file. It's also possible to force the target language using configuration 28 October 2012, 11:50:23 UTC
5ef43c5 adds Parsely 27 October 2012, 20:23:18 UTC
e6cb957 Cache the list of stop words (per language). This avoids re-reading all of the stop words from disk continuously. Tanks to Parsely 27 October 2012, 20:22:25 UTC
fbd1269 Only create the trans table once and reuse it. 27 October 2012, 20:18:27 UTC
3cb0f4b pep8 27 October 2012, 20:30:17 UTC
7261316 Make the precompiled PUNCTUATION regex actually reusable. 27 October 2012, 19:41:07 UTC
16de610 Better MANIFEST.in 27 October 2012, 19:32:43 UTC
c60bb53 Remove useless module 27 October 2012, 19:23:49 UTC
7d88d01 Remove "facebook_broadcasting" junk Remove the "facebook-broadcasting" div which would lead to 'Click "Add to Timeline" to publish what you read to Facebook' becoming the article text. This issue was surfacing while looking at CBS Local affiliate site support. 27 October 2012, 19:15:52 UTC
e5e2869 Update include better OS X installation instructions 27 October 2012, 19:09:28 UTC
back to top