https://github.com/google/sling

sort by:
Revision Author Date Message Commit Date
2956799 Fix subsumed calculation (#345) * Fix subsumed calculation * Add date special case handling 20 March 2019, 21:51:44 UTC
080c05a Minor fixes to the Wikicat browser (#344) - Some cosmetic fixes - Fix a counting bug while generating recordio 19 March 2019, 14:30:02 UTC
a6a6775 Enhancements to the Wikicat browser (#343) - Hovering over fact-matching counts now brings up a list of qids that illustrate that count. - While browsing a signature, the user now has an option to generate Wikibot recordio files directly from the browser. - Modify the fact member so that for it stores exemplars for each match type. Further, this list is exhaustive for NEW, ADDITIONAL, and SUBSUMED_BY_EXISTING. This is used in both the new features above. - Simplified the browser code a bit, and removed some unnecessary Javascript. 18 March 2019, 17:29:57 UTC
64c35cf Fix the problem with women (#341) 15 March 2019, 13:31:01 UTC
fce18b2 Infobox aliases and wiki links (#339) 15 March 2019, 11:49:51 UTC
b765579 Various Wikicat fixes (#340) - Omit outputting low-frequency (pid, qid) spans. - Move subsumption checking code to Python - We only get the closure from C++ code now - Subsumption code in Python checks for genre, subclass, part_of, parent_org, located_in, and date subsumption. More properties can be added easily. - Fix a minor bug in the browser. - Expose the Resolve() method in the Python API. 13 March 2019, 17:45:32 UTC
97746ce Token styles (#338) 13 March 2019, 13:13:27 UTC
05373f7 XML frame reader (#337) 11 March 2019, 09:41:31 UTC
cdacfc0 Browser for category parses (#336) Allows browsing by category, signature, or top signatures (denoted by 'top'). Supports coarse and fine signatures, and three metrics to sort the parses with. Allows customization of fact-matching scores. 26 February 2019, 22:02:44 UTC
7c53dec A few fixes to wikibot.py (#332) 25 February 2019, 10:13:42 UTC
b0388ed Check HTTP path after unescaping (#335) 25 February 2019, 10:00:37 UTC
1c09339 Fact-matching statistics for category parsing (#331) * FactMatcher that computes how proposed facts in a parse match with existing facts for the same property. They are classified as new, or matching existing facts (either exactly or via subsumption), or conflicting (e.g. for unique-valued properties) or additional facts. * A workflow task that attaches this information to each parse. Other changes: * Add methods to FactExtractor for: (a) only reporting facts for specified properties. (b) report facts with or without backoff (aka closure). Previously the only supported mode was with backoff. I have confirmed that this change doesn't affect the runtime of the backoff mode. (c) Add a method to check if one value subsumes another. These methods also come with the corresponding Python API methods. * Skip empty parses in the parse generator. * A performance improvement: we replace a sling.Array with a python list, allowing the sling.Store behind the array to be garbage collected. * Store the list of category members in the category frame. These only cover legit members, e.g. they exclude subcategories. * Add 'type of sport' and 'cause of death' to the custom taxonomy. * Replace prior and member_score values with their geometric means. This makes it fair to compare a parse with many spans vs a parse with only a few spans (since the priors and member_scores are multiplicative across spans). * Also attach the coarse signature to each parse. * Allow load_kb() to also take filename arguments, and make it use a global pool of loaded KBs. This way multiple tasks can share a KB, which saves both memory and runtime. * Add a 'skip_generation' flag to the workflow, so we have an option to not run (and instead use the cached output of) the expensive candidate parse generation stage. 21 February 2019, 04:13:35 UTC
87b8666 Handle Unicode strings in Python frame API (#330) 06 February 2019, 18:34:19 UTC
797ec43 Fix benign error in KB UI. (#329) When deleting characters from the search bar the update handler gets a null item. Check it before dereferencing `.ref` on it. For reference, the console error: TypeError: Cannot read property 'ref' of undefined at Object.self.selectedItemChange (kb.js:46) at fn (eval at compile (angular.js:14605), <anonymous>:4:318) at m.d.(:8080/kb/anonymous function) [as itemChange] (http://ajax.googleapis.com/ajax/libs/angularjs/1.5.7/angular.min.js:83:232) at D (angular-material.min.js:13) at N (angular-material.min.js:13) at m.$digest (angular.js:17286) at b.$apply (angular.js:17552) at Pg.$$debounceViewValueCommit (angular.js:27516) at Pg.$setViewValue (angular.js:27488) at HTMLInputElement.l (angular.js:23730) 01 February 2019, 13:27:17 UTC
8f0d22d Better extraction and upload of birth and death dates. (#328) 28 January 2019, 10:06:26 UTC
7ea2aa6 [cpu] Replace _xgetbv identifier with xgetbv (#327) 22 January 2019, 21:39:05 UTC
93c2ab1 Alias transfer (#326) 22 January 2019, 09:35:43 UTC
e037430 Update README.md 14 January 2019, 22:53:18 UTC
051b8ad Fix NotShiftOrMarkDelegate to work even if there is no MARK action (#324) When the training corpora only consists of single-token spans, MARK is not added to the action table. Therefore actions.mark() will return None and break NotShiftOrMarkDelegate. This PR fixes the delegate's behavior in this corner case. 14 January 2019, 21:50:13 UTC
18f0b22 Wikicat parse generator, filter, and signature builder. (#320) * Initial version of the parse generator + ranker + signature producer. - We get about 3.5M filtered parses across 520K acceptable English category titles. * Bug fix in Task::GetInputs() 04 January 2019, 22:10:50 UTC
60643d6 Template expansion for wikitext (#322) 02 January 2019, 17:08:55 UTC
c335393 Data files for template extraction (#321) 02 January 2019, 15:00:55 UTC
0e20b4d Refining birth and death date extraction and updating (#319) 21 December 2018, 14:42:59 UTC
4def25e Fix ; and : tokens to not be condeos (#318) 19 December 2018, 13:13:35 UTC
4ec8a7f Extracting birth death dates from English Wikipedia articles (#317) 19 December 2018, 12:35:16 UTC
63130a1 Adding 'figure dash' and 'minus sign' tokens and a few abbreviations (#316) * Adding 'figure dash' and 'minus sign' tokens and a few abbreviation 17 December 2018, 15:48:48 UTC
c63640d Wikiflow uses latest dump files. Improved wikibot handling of date precision. (#315) * Wikiflow downloads latest dump files by default. * Improved handling of date precision in wikibot. 17 December 2018, 10:11:16 UTC
3a1ea68 Refactor wiki text parser (#314) 14 December 2018, 13:32:42 UTC
f9163e9 Treat underscore as letter in tokenization (#313) 13 December 2018, 10:38:52 UTC
66fb7fb Matrix-matrix multiplication for integers (#312) 12 December 2018, 09:41:26 UTC
6f0acf0 Split kernel (#311) 10 December 2018, 16:44:41 UTC
7483c02 Handle Wikidata sense ids in converter (#310) 10 December 2018, 11:57:05 UTC
7b0690e Prepare for google3 import (#308) 07 December 2018, 14:15:25 UTC
cd25f83 Fix bug in taxonomy (#307) 06 December 2018, 16:07:41 UTC
d51bac6 Consolidate reduction ops (#306) 06 December 2018, 11:15:11 UTC
d5fe5bd Myelin test suite and bug fixes (#305) 05 December 2018, 20:44:54 UTC
666befd Use Python API for frame evaluation (#304) Bonus: Since the Python API doesn't need to read the common store from a file, we can get rid of commons_path at various places. 04 December 2018, 18:35:29 UTC
523f73f Python API for Myelin (#302) 04 December 2018, 12:18:06 UTC
756bebf Python iterator for RecordDatabase (#303) 03 December 2018, 08:59:01 UTC
db39e09 Myelin fixes to make the new parser run on GPU (#299) 30 November 2018, 13:30:05 UTC
9d40468 Fix bug in Fingerprint function and other fixes (#300) 28 November 2018, 15:17:01 UTC
f4e51c8 Update code to match the updated sling calendar/date interface. (#298) 26 November 2018, 14:02:13 UTC
811e1b7 Fix date values to be compatible with stored format (#297) 26 November 2018, 10:24:13 UTC
debee6c Adding note about building on WSL. (#295) 21 November 2018, 14:16:24 UTC
474f4e5 Adding ability to extract inception dates (#294) 21 November 2018, 09:49:24 UTC
8216ee3 Alias case forms and taxonomies (#292) 21 November 2018, 09:37:16 UTC
14c5c07 Add back myelin tf extractor (#293) 16 November 2018, 14:51:12 UTC
b8b1234 Bug fixes for document viewer for documents with WikiData items (#290) 14 November 2018, 15:25:07 UTC
938c01d Update documentation for v2 (#289) 14 November 2018, 11:41:18 UTC
88713e6 Update README.md 14 November 2018, 10:08:15 UTC
7b5db94 Script for downloading and converting OntoNotes (#287) 13 November 2018, 09:57:19 UTC
cad5b52 Mark/Evoke transition system and lots of other goodies (#286) 1. Modify the transition system by adding a MARK action to denote the beginning of a multi-token span. A MARK is paired with a subsequent EVOKE(type=t) action that evokes a span from the marked token to the current token. Marked tokens are pushed on a stack, which automatically gives a nested-semantics of most-nested-span-first. Meanwhile single token spans continue to be generated via EVOKE(length=1, type=t) actions, and the REFER action is unchanged. MARK actions are accompanied by four new features: binned distance from the current token to the top of the mark stack, the activation when the topmost marked token was pushed, the LR and RL LSTM vectors of the token where the MARK was output. Splitting multi-token EVOKEs into MARK+EVOKE reduces the learning burden, and also gets rid of various EVOKE(len=L, type=T) actions where L>1. This PR also increases the history feature's size from 4 to 5 to account for the general increase in transition sequence lengths. Overall this boosts SPAN F1 from 92.5 -> 92.8, ROLE F1 from 72.2 -> 72.5, SLOT F1 from 82.8 -> 83.2. 2. This PR also simplifies the ParserState class by using Document's span-indexing code to judge when a span can be legally evoked. Doing this allows us to (a) get rid of the Nesting structure, (b) directly deal with frame handles instead of going via indices and integer handles, (c) get rid of the 'frames' array and solely deal with the attention buffer, (d) get rid of AddParseToDocument(), since frames/roles are eagerly added to the document as soon as they are generated. 3. Support for reading the flow file in the Python setup. This allows us to read the flow and perform inference. This PR adds a parse.py tool parallel to the parse.cc tool. Reading the flow file requires us to only read some simple fields for the Spec class, and all the big ticket items (e.g. commons, actions, cascade) are read off the commons blob. 4. Support for tracing. This PR adds adding the tracing information as a frame to the document, in both Python and C++. It is implemented as pay-as-you-go and when disabled, it doesn't cause any performance penalties (i.e. no timing differences). The tracing information has already proved invaluable whilst ensuring parity of the MARK/EVOKE C++ runtime vs Python. The tracing file in Python (trace.py) can also be used as a standalone script that compares two recordios of documents with tracing information, and checks for equality of traces. Right now it checks for lstm feature indices, ff features, and predicted & final actions. This was very useful in tracking down a few corner cases where the lexical features were different in Python vs C++. Minor changes: - The transition generator has also been simplified a bit, given that parts of it had to be rewritten to accommodate the new MARK/EVOKE mechanism. - We can now supply an output recordio file in parse.cc - Remove the 'small' flag in Spec, since we can use the tracing utility to debug models now. - Fix two lexical feature disparities: (a) The C++ runtime was marking backquotes ("`") and carets ("^") as punctuation because CHARCAT_MODIFIER_SYMBOL was in the punctuation mask. I have removed it and now it behaves the same as in Python. (b) Fixed a missing encode('utf-8') call when I was generating suffixes that were unicode characters (there were only two such cases in the dev corpus). - Print final and best checkpoint metrics in viewmodel.py 12 November 2018, 21:06:22 UTC
0ae60bf Feature tracing for lexical encoder (#288) 12 November 2018, 12:55:42 UTC
a55dec7 Change default parameter initialization strategy (#284) Initialize matrices with orthonormal initialization, softmax heads with Xavier normalization. This together with shuffling the input gives a pretty significant gain across the board. SPAN F1 91.6 -> 92.5 FRAME F1 92.9 -> 93.7 ROLE F1 71.6 -> 72.0 TYPE F1 87.5 -> 89.6 SLOT F1 81.3 -> 82.7 24 October 2018, 22:19:05 UTC
0012fa6 Platform-dependent compilation of pysling (#285) 24 October 2018, 13:47:22 UTC
a53b317 More heuristics to generate noun phrases in the Ontonotes converter (#281) The updated set of heuristics is now: Marks [NML HYPH PP] spans as noun phrases, e.g. [Commander - in - chief], or [right - of - way]. The inner NML is not processed as a nested span. This yields 105 noun phrases. Each base NML span (i.e. only has token children) is split into noun phrases, such that each phrase: doesn't overlap with an existing NER/PRED/noun-phrase span, AND doesn't contain any conjunction (CC) token, AND doesn't begin or end in a HYPH token This yields ~7800 spans. Each base NP (i.e. with no NP descendants) is split into noun phrase(s) using the same heuristic as above for NML spans, with an additional restriction that each phrase token have a noun/hyphen part-of-speech tag. This yields ~178K spans. Each recursive NP of the form [NP ending in POS, token constituents] yields noun phrase(s) from the token constituents portion. For example, the NP "Japan's economy and development" would yield 'economy' and 'development' phrases. We need this where the token constituents are not covered by base NP(s). This yields ~5K spans. Marks each pronoun as a noun-phrase (of type PERSON for some pronouns). We now have a total of 664K spans (up from ~630K earlier). 22 October 2018, 17:30:43 UTC
25becca Simplify spec creation, and a few other cleanups. (#280) Remove Resources class. Remove the need to write the commons store twice. Write it directly to the output folder now. Few other touch-ups. A couple more validation checks. 19 October 2018, 21:48:23 UTC
d1bf1db Fix token words in corpus browser etc. (#279) 12 October 2018, 21:03:10 UTC
6adc38d Further tweaks to the Ontonotes Converter (#278) Existing NP Expansion heuristic has been removed. Add extra noun phrases using base NP and NML constituents. This adds 173K and 3.5K spans respectively. Subsequent spans (e.g. arguments, coref) are normalized to these spans now, if possible. Trim trailing possessives (e.g. China's => China). This affects only ~700 spans. Particle inclusion only done for verb predicates now. Docid is stored as /ontonotes/docid instead of the frame id. This will allow 2 docs with the same docid to be read in the same store (e.g. gold and test doc). More head-finding rules for NML. Span generator method, which cleans up and simplifies span iteration code. Predicates are now /pb/ instead of /pb/pred/. Arguments are now /pb/ instead of just . With all these changes, there are a total of 630K output spans now (up from 580K before this PR). The change is only +50K because we are able to align a lot more argument spans now. Known minor issue: 724 argument spans still lie inside an NER span. All of them don't share the head with the NER. In 434 of these, the predicate of the argument is also inside the NER. This is because annotators have annotated inside NER spans as well. We have kept these 724 spans for now. 12 October 2018, 18:25:38 UTC
e18afaf Remove unused schemas (#277) 12 October 2018, 09:30:24 UTC
bd12818 Corpus browser (#276) 10 October 2018, 19:13:07 UTC
5af0dbf External third-party web components (#275) 10 October 2018, 12:54:26 UTC
3f6f7c0 Renaming dashboard and updating fact extraction (#274) 09 October 2018, 14:40:32 UTC
74989ce Tweaks to the Ontonotes v5 converter (#271) Implement reduction to span head, and NP expansion heuristics. Option for skipping conjunctions during normalization (disabled by default) Option for using last token as span head if head computation failed (enabled by default) Output span nesting histogram and crossing span counts Output spans not matching any constituents Overall script for running the conversion, which also creates the commons and drops invalid sentences. With this PR, the normalization strategy is now: Drop leading articles, if enabled. Follow prepositions to objects, if enabled. For every SRL Arg and Coref span s: - Skip s if s is a conjunction and conjunctions are to be skipped. - [shrink_using_heads]: If s has the same head as a previously normalized span s' and s' is shorter than s, then s = s' - [reduce_to_head]: Otherwise s = head(s) - [np_expansion]: If s was shrunk to its head above, expand it to cover the full noun phrase (if one exists). Expand predicates to include particles, if enabled. With these changes we have the following output statistics: 99.9% output spans are length <= 6 (in contrast to only 89% original spans having length <= 6). 99% spans are length <= 3 Because of the reduction to head and careful NP expansion, only 0.7% spans are nested (that too at a max nesting depth = 1). Only 7 spans are crossing spans (these are automatically removed by the conversion shell script). We had 660K input unnormalized spans, of which 80K were shrunk to smaller existing spans, 346K were reduced to head token, and 7K were reduced to head and then NP-expanded. We ended up with 580K spans at the end. 08 October 2018, 19:04:51 UTC
d90f485 Update wiki dump file dates (#273) 08 October 2018, 11:20:18 UTC
6fddd7b Wikidata form ids (#272) 08 October 2018, 11:09:24 UTC
6ad54c6 Edge-triggered epoll in HTTP server (#270) 08 October 2018, 08:18:22 UTC
a9a9ab9 Add ability to process Wikibot logs (#267) 03 October 2018, 14:34:16 UTC
047f62c Ontonotes v5 Converter (#264) The converter handles all annotations in the Ontonotes v5 corpus, performs various span normalizations and writes the output as SLING documents, along with a detailed summary of the input, normalization, and output. Supported span normalization heuristics: Drop leading articles. Descend from prepositions to their objects. Shrinking a span to another if they share the same head. Particle inclusion in SRL predicates. Each normalization can be turned on/off on the commandline. Other commandline flags include: Importing coreference annotations or not (default=not) Output per sentence documents (default=yes) This PR also extends some util scripts, and makes a few fixes after the symbol-shortening change. 28 September 2018, 17:14:50 UTC
7bb0d74 Fix error in computing token word for Python Document (#268) 28 September 2018, 15:08:13 UTC
eb7904a Update wikiflow.md with a tip to set TMPDIR (#266) 28 September 2018, 14:44:36 UTC
05f9544 Adding bot for storing fact records in wikidata (wikibot.py). (#260) * Adding bot for storing fact records in wikidata (wikibot.py). 21 September 2018, 15:14:30 UTC
190b3ad Fix bug in Wikipdia alias qid etc. (#263) 21 September 2018, 14:09:32 UTC
bb5a406 Compact SLING documents (#259) 20 September 2018, 07:26:02 UTC
758ff08 Replace Collect and Lookup kernels with Gather (#262) Bonus: fix a corner case bug in ParserState, where we were erroneously disallowing any non-STOP action at the end of the sequence. This meant that any CONNECT actions at the very end would be wrongly disallowed. 19 September 2018, 23:37:58 UTC
ebe1d2e Use compare by value semantics for anonymous frames in pyapi (#261) 19 September 2018, 14:38:54 UTC
3026c2d Word and fact embeddings (#256) 17 September 2018, 12:25:56 UTC
50be27f Prevent crash in parser profiling (#258) 17 September 2018, 08:30:04 UTC
2d8c1b7 Fix clang warnings (#257) 12 September 2018, 13:11:38 UTC
6ade83a PyArray slices (#254) 11 September 2018, 10:40:41 UTC
3d1b365 Python API for fact extractor (#253) 11 September 2018, 09:19:58 UTC
a75b6ea Knowledge base snapshot (#252) 11 September 2018, 09:18:59 UTC
0de4b91 Adding wiki birth date extraction code (#247) * added wiki birth date extraction code * addressing comments 08 September 2018, 06:16:50 UTC
434c545 Dynamic HTTP worker pool expansion (#251) 07 September 2018, 18:50:20 UTC
2eaa34c Ahead-of-time Myelin compiler (#250) 07 September 2018, 16:32:19 UTC
d81d6f8 Frame store snapshots (#248) 07 September 2018, 14:41:00 UTC
645837d Category graph inversion (#249) 07 September 2018, 14:00:39 UTC
b699abb Memory-mapped repository files (#244) 07 September 2018, 08:42:43 UTC
2511361 Add optional oov input to Gather (#245) 07 September 2018, 08:41:21 UTC
e99ba0f ISO 8601 dates (#246) 06 September 2018, 09:53:22 UTC
870bede Calendar bug fixes (#243) 03 September 2018, 12:03:10 UTC
0f00f33 Flow reader, and a utility to print training details saved in a flow file. (#241) 30 August 2018, 14:29:20 UTC
ec6b156 Record file index (#242) 30 August 2018, 12:37:30 UTC
c98b210 Fix bug in PyFrame::Extend (#240) 29 August 2018, 17:00:48 UTC
f2c1dd5 Knowledge base fact extractor (#239) 28 August 2018, 08:05:24 UTC
e0882af Parse Wikipedia category pages (#238) 28 August 2018, 08:04:52 UTC
139a1f8 Task for running multi-threaded learning (#235) 24 August 2018, 11:53:38 UTC
01e59fb Bloom filter, Top and SortMap utility classes (#236) 24 August 2018, 09:51:10 UTC
f6bc444 Python API for Wikidata converter (#234) 23 August 2018, 08:35:59 UTC
0240136 Update frames.md 23 August 2018, 08:09:21 UTC
back to top