{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text-Fabric\n", "\n", "This is testing ground for algorithms to be used in Text-Fabric." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Range manipulation\n", "\n", "### Convert iterable to ranges\n", "Convert an iterable of numbers to a comma separated optimal list of ranges\n", "optimal means:\n", "1. the ranges are sorted\n", "2. adjacent ranges do not share a boundary, i.e. there is a real gap\n", " between adjacent ranges\n", "**NB:** Unlike in Python, ranges include their boundary values.\n", "\n", "The parameter `nlist` can be any iterable, arrays, lists, \n", "dictionary keys, sets but the items must be numbers, not strings.\n", "The iterable will be sorted and stripped of duplicates.\n", "\n", "We also define the converse: from a list of number ranges to an iterable, in this case: a list. The `ranges` parameter is a list of tuples `start` and `end`, which are the boundaries of a range. The boundaries are inclusive. Instead of a tuple, a single integer may be provided.\n", "If `end` is bigger than `start`, the two will be swapped around.\n", "The result will be a sorted list without duplicates." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import collections" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def convert_to_ranges(nlist):\n", " ranges = []\n", " curstart = None\n", " curend = None\n", " for n in sorted(set(nlist)):\n", " if curstart == None:\n", " curstart = n\n", " curend = n\n", " elif n == curend + 1:\n", " curend = n\n", " else:\n", " ranges.append((curstart, curend))\n", " curstart = n\n", " curend = n\n", " if curstart != None:\n", " ranges.append((curstart, curend))\n", " return ranges" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def ranges_from_text(ranges_text):\n", " result = []\n", " for r_str in ranges_text.split(','):\n", " bounds = r_str.split('-')\n", " if len(bounds) == 1:\n", " result.append(int(bounds[0]))\n", " else:\n", " result.append((int(bounds[0]), int(bounds[1])))\n", " return result\n", "\n", "def ranges_to_list(ranges):\n", " covered = set()\n", " for r in ranges:\n", " if type(r) is tuple:\n", " (start, end) = r\n", " if start > end:\n", " (end, start) = r\n", " for i in range(start, end+1):\n", " covered.add(i)\n", " else:\n", " covered.add(r)\n", " return sorted(covered)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def ranges_rep(ranges):\n", " return ','.join(\n", " str(r) if type(r) is int else\\\n", " str(r[0]) if len(r) == 1 or r[0]==r[1] else\\\n", " '{}-{}'.format(*r)\\\n", " for r in ranges\n", " ) " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tests_r2l = dict(\n", " a_empty = [],\n", " b_simple = [(5,10)],\n", " c_swapped = [(10,5)],\n", " d_single = [1,2,3,4,5],\n", " e_mixed = [1,2, (10,15), 16],\n", " f_duplicate = [1, (1,5), (3,7), 7],\n", " g_order = [7, (1,5), (7,3), 1],\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[a_empty ] []\n", "[b_simple ] [5, 6, 7, 8, 9, 10]\n", "[c_swapped ] [5, 6, 7, 8, 9, 10]\n", "[d_single ] [1, 2, 3, 4, 5]\n", "[e_mixed ] [1, 2, 10, 11, 12, 13, 14, 15, 16]\n", "[f_duplicate ] [1, 2, 3, 4, 5, 6, 7]\n", "[g_order ] [1, 2, 3, 4, 5, 6, 7]\n" ] } ], "source": [ "for (t, data) in sorted(tests_r2l.items()):\n", " print('[{:<12}] {}'.format(t, ranges_to_list(data)))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tests_l2r = dict(\n", " a_empty = [],\n", " b_consec = range(5,10),\n", " c_mixed = ranges_to_list([(1,4), (5,8), (10,12), (100, 110)]),\n", " d_mixed_up = [1,2, 5,6,7, 2,3, 3,4, 12,11,10],\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[a_empty ] []\n", "[b_consec ] [(5, 9)]\n", "[c_mixed ] [(1, 8), (10, 12), (100, 110)]\n", "[d_mixed_up ] [(1, 7), (10, 12)]\n" ] } ], "source": [ "for (t, data) in sorted(tests_l2r.items()):\n", " print('[{:<12}] {}'.format(t, convert_to_ranges(data)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exporting otype\n", "We want to export the `otype` feature in compact form, like this:\n", "\n", " 0-425000 word\n", " 500-700000 phrase\n", "\n", "and so on for each object type.\n", "\n", "We do the exercise for the brandnew data version ETCBC4c." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s This is LAF-Fabric 4.8.3\n", "API reference: http://laf-fabric.readthedocs.org/en/latest/texts/API-reference.html\n", "Feature doc: https://shebanq.ancient-data.org/static/docs/featuredoc/texts/welcome.html\n", "\n" ] } ], "source": [ "from laf.fabric import LafFabric\n", "from etcbc.preprocess import prep\n", "fabric = LafFabric()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 0.00s LOADING API: please wait ... \n", " 0.00s DETAIL: COMPILING m: etcbc4c: UP TO DATE\n", " 0.00s USING main: etcbc4c DATA COMPILED AT: 2016-11-09T19-16-37\n", " 0.01s DETAIL: load main: G.node_anchor_min\n", " 0.10s DETAIL: load main: G.node_anchor_max\n", " 0.14s DETAIL: load main: G.node_sort\n", " 0.19s DETAIL: load main: G.node_sort_inv\n", " 0.56s DETAIL: load main: G.edges_from\n", " 0.62s DETAIL: load main: G.edges_to\n", " 0.68s DETAIL: load main: F.etcbc4_db_monads [node] \n", " 1.34s DETAIL: load main: F.etcbc4_db_otype [node] \n", " 1.96s DETAIL: load main: F.etcbc4_ft_g_word_utf8 [node] \n", " 2.24s DETAIL: load main: F.etcbc4_ft_sp [node] \n", " 2.40s DETAIL: load main: F.etcbc4_ft_trailer_utf8 [node] \n", " 2.51s DETAIL: load main: F.etcbc4_ft_functional_parent [e] \n", " 2.75s DETAIL: load main: F.etcbc4_ft_mother [e] \n", " 2.81s DETAIL: load main: C.etcbc4_ft_functional_parent -> \n", " 3.52s DETAIL: load main: C.etcbc4_ft_mother -> \n", " 3.66s DETAIL: load main: C.etcbc4_ft_functional_parent <- \n", " 4.19s DETAIL: load main: C.etcbc4_ft_mother <- \n", " 4.38s LOGFILE=/Users/dirk/laf/laf-fabric-output/etcbc4c/TF/__log__TF.txt\n", " 4.38s INFO: LOADING PREPARED data: please wait ... \n", " 4.39s prep prep: G.node_sort\n", " 4.45s prep prep: G.node_sort_inv\n", " 4.97s prep prep: L.node_up\n", " 7.81s prep prep: L.node_down\n", " 14s prep prep: V.verses\n", " 14s prep prep: V.books_la\n", " 14s ETCBC reference: http://laf-fabric.readthedocs.org/en/latest/texts/ETCBC-reference.html\n", " 15s INFO: LOADED PREPARED data\n", " 15s INFO: DATA LOADED FROM SOURCE etcbc4c AND ANNOX FOR TASK TF AT 2016-11-25T13-08-40\n" ] } ], "source": [ "fabric.load('etcbc4c', '--', 'TF', {\n", " \"xmlids\": {\"node\": False, \"edge\": False},\n", " \"features\": ('''\n", " otype monads g_word_utf8 trailer_utf8\n", " sp\n", " ''','''\n", " mother\n", " functional_parent\n", " '''),\n", " \"prepare\": prep(select={'L'}),\n", " \"primary\": False,\n", "}, verbose='DETAIL')\n", "exec(fabric.localnames.format(var='fabric'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## otype" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "otypes = '''\n", " word\n", " subphrase\n", " phrase_atom\n", " phrase\n", " clause_atom\n", " clause\n", " sentence_atom\n", " sentence\n", " half_verse\n", " verse\n", " chapter\n", " book\n", "'''.strip().split()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 27s word 1 ranges\n", " 28s subphrase 1 ranges\n", " 29s phrase_atom 1 ranges\n", " 30s phrase 1 ranges\n", " 31s clause_atom 1 ranges\n", " 32s clause 1 ranges\n", " 33s sentence_atom 1 ranges\n", " 34s sentence 1 ranges\n", " 35s half_verse 1 ranges\n", " 36s verse 1 ranges\n", " 38s chapter 1 ranges\n", " 39s book 1 ranges\n", "word from 0 to 426580\n", "clause from 426581 to 514580\n", "clause_atom from 514581 to 605142\n", "phrase from 605143 to 858316\n", "phrase_atom from 858317 to 1125831\n", "sentence from 1125832 to 1189401\n", "sentence_atom from 1189402 to 1253740\n", "subphrase from 1253741 to 1367532\n", "book from 1367533 to 1367571\n", "chapter from 1367572 to 1368500\n", "half_verse from 1368501 to 1413680\n", "verse from 1413681 to 1436893\n" ] } ], "source": [ "ranges = {}\n", "for otype in otypes:\n", " these_ranges = convert_to_ranges(F.otype.s(otype))\n", " inf('{:<15} {:>5} ranges'.format(\n", " otype, len(these_ranges),\n", " ), withtime=True)\n", " ranges[otype] = these_ranges\n", "for (otype, ((start, end),)) in sorted(ranges.items(), key=lambda x: x[1][0][0]):\n", " print('{:<15} from {:>7} to {:>7}'.format(otype, start, end))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a slightly different format:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0-426580\tword\n", "426581-514580\tclause\n", "514581-605142\tclause_atom\n", "605143-858316\tphrase\n", "858317-1125831\tphrase_atom\n", "1125832-1189401\tsentence\n", "1189402-1253740\tsentence_atom\n", "1253741-1367532\tsubphrase\n", "1367533-1367571\tbook\n", "1367572-1368500\tchapter\n", "1368501-1413680\thalf_verse\n", "1413681-1436893\tverse\n" ] } ], "source": [ "of = outfile('otype.tf')\n", "for (otype, ((start, end),)) in sorted(ranges.items(), key=lambda x: x[1][0][0]):\n", " print('{}-{}\\t{}'.format(start, end, otype))\n", " of.write('{}-{}\\t{}\\n'.format(start, end, otype))\n", "of.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Map monad numbers\n", "In TF we make sure that the monads go from 0-maxmonad consecutively. So we have to map the original monad numbers\n", "to the node numbers of the words in TF." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tfFromMonad = {}\n", "for w in F.otype.s('word'):\n", " m = int(F.monads.v(w))\n", " tfFromMonad[m] = w\n", "of = outfile('tfFromMonad.csv')\n", "of.write('monad\\tTF\\n')\n", "for x in sorted(tfFromMonad.items()):\n", " of.write('{}\\t{}\\n'.format(*x))\n", "of.close()" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def tfFromMonadList(mList): return [tfFromMonad[m] for m in mList]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## monads\n", "\n", "Here is code to write edge information in compact text files.\n", "We select a domain with a very limited amount of nodes, and collect all edges between them." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The first word object is 0\n", "The first subphrase object is 1253741\n", "The first phrase_atom object is 858317\n", "The first phrase object is 605143\n", "The first clause_atom object is 514581\n", "The first clause object is 426581\n", "The first sentence_atom object is 1189402\n", "The first sentence object is 1125832\n", "The first half_verse object is 1368501\n", "The first verse object is 1413681\n", "The first chapter object is 1367572\n", "The first book object is 1367533\n" ] } ], "source": [ "first_object = {}\n", "for otype in otypes:\n", " first_object[otype] = ranges[otype][0][0]\n", " print('The first {} object is {}'.format(otype, first_object[otype]))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1367533\t1-28762\t\n", "1367534\t28763-52510\t\n", "1367572\t1-673\t\n", "1367573\t674-1167\t\n", "1413681\t1-11\t\n", "1413682\t12-31\t\n", "1368501\t1-4\t\n", "1368502\t5-11\t\n", "1125832\t1-11\t\n", "1125833\t12-18\t\n", "1189402\t1-11\t\n", "1189403\t12-18\t\n", "426581\t1-11\t\n", "426582\t12-18\t\n", "514581\t1-11\t\n", "514582\t12-18\t\n", "605143\t1-2\t\n", "605144\t3\t\n", "858317\t1-2\t\n", "858318\t3\t\n", "1253741\t5-7\t\n", "1253742\t9-11\t\n" ] } ], "source": [ "for otype in reversed(otypes[1:]):\n", " for n in range(first_object[otype], first_object[otype]+2):\n", " print('{}\\t{}\\t'.format(n, F.monads.v(n)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A few features\n", "\n", "First node features." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "nfeats = '''\n", " g_word_utf8\n", " trailer_utf8\n", " sp\n", "'''.strip().split()\n", "\n", "efeats = '''\n", " mother\n", " functional_parent\n", " distributional_parent\n", "'''.strip().split()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "[g_word_utf8]\n", "\n", "בְּ\n", "רֵאשִׁ֖ית\n", "בָּרָ֣א\n", "אֱלֹהִ֑ים\n", "אֵ֥ת\n", "הַ\n", "שָּׁמַ֖יִם\n", "וְ\n", "אֵ֥ת\n", "הָ\n", "אָֽרֶץ\n", "וְ\n", "הָ\n", "אָ֗רֶץ\n", "הָיְתָ֥ה\n", "תֹ֨הוּ֙\n", "וָ\n", "בֹ֔הוּ\n", "וְ\n", "חֹ֖שֶׁךְ\n", "\n", "[trailer_utf8]\n", "\n", "\n", "_\n", "_\n", "_\n", "_\n", "\n", "_\n", "\n", "_\n", "\n", "׃_\n", "\n", "\n", "_\n", "_\n", "_\n", "\n", "_\n", "\n", "_\n", "\n", "[sp]\n", "\n", "prep\n", "subs\n", "verb\n", "subs\n", "prep\n", "art\n", "subs\n", "conj\n", "prep\n", "art\n", "subs\n", "conj\n", "art\n", "subs\n", "verb\n", "subs\n", "conj\n", "subs\n", "conj\n", "subs\n" ] } ], "source": [ "def escape_ft(s):\n", " return s.replace('\\\\', '\\\\\\\\').replace('\\t', '\\\\t').replace('\\n', '\\\\n').replace(' ', '_')\n", "\n", "for feat in nfeats:\n", " print('\\n[{}]\\n'.format(feat))\n", " for n in range(20):\n", " print(escape_ft(F.item[feat].v(n)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then an edge feature: `mother`. We list all the edges involving the first eleven words, and the objects that surround them.\n", "We follow all edges and inverted edges until we reach end points.\n", "Exclude chains of `clause_atom`s." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Analyzing edge feature \"mother\"\n", "Starting with 30 nodes\n", "Iteration 30\n", " 0 unchecked nodes\n", " 30 checked nodes\n", "0\tword\n", "1\tword\n", "2\tword\n", "3\tword\n", "4\tword\n", "5\tword\n", "6\tword\n", "7\tword\n", "8\tword\n", "9\tword\n", "10\tword\n", "426581\tclause\n", "514581\tclause_atom\n", "605143\tphrase\n", "605144\tphrase\n", "605145\tphrase\n", "605146\tphrase\n", "858317\tphrase_atom\n", "858318\tphrase_atom\n", "858319\tphrase_atom\n", "858320\tphrase_atom\n", "1125832\tsentence\n", "1189402\tsentence_atom\n", "1253741\tsubphrase\n", "1253742\tsubphrase\n", "1367533\tbook\n", "1367572\tchapter\n", "1368501\thalf_verse\n", "1368502\thalf_verse\n", "1413681\tverse\n", "30 nodes\n", "Analyzing edge feature \"functional_parent\"\n", "Starting with 30 nodes\n", "Iteration 30\n", " 0 unchecked nodes\n", " 30 checked nodes\n", "0\tword\n", "1\tword\n", "2\tword\n", "3\tword\n", "4\tword\n", "5\tword\n", "6\tword\n", "7\tword\n", "8\tword\n", "9\tword\n", "10\tword\n", "426581\tclause\n", "514581\tclause_atom\n", "605143\tphrase\n", "605144\tphrase\n", "605145\tphrase\n", "605146\tphrase\n", "858317\tphrase_atom\n", "858318\tphrase_atom\n", "858319\tphrase_atom\n", "858320\tphrase_atom\n", "1125832\tsentence\n", "1189402\tsentence_atom\n", "1253741\tsubphrase\n", "1253742\tsubphrase\n", "1367533\tbook\n", "1367572\tchapter\n", "1368501\thalf_verse\n", "1368502\thalf_verse\n", "1413681\tverse\n", "30 nodes\n" ] } ], "source": [ "words = set(range(11))\n", "domain = {}\n", "\n", "for feat in efeats:\n", " print('Analyzing edge feature \"{}\"'.format(feat))\n", " unchecked_nodes = set()\n", " for w in words:\n", " unchecked_nodes |= {w} | {L.u(otype, w) for otype in otypes[1:] if L.u(otype, w) != None} \n", " print('Starting with {} nodes'.format(len(unchecked_nodes)))\n", " checked_nodes = set()\n", " max_iter = 1000\n", " i = 0\n", " while len(unchecked_nodes):\n", " i +=1\n", " if i >= max_iter:\n", " print('Iteration {:>3}\\n!Max iterations reached!\\n{:>3} unchecked nodes\\n{:>3} checked nodes'.format(\n", " i,\n", " len(unchecked_nodes),\n", " len(checked_nodes),\n", " ))\n", " break\n", " n = sorted(unchecked_nodes)[0]\n", " to_edges = {m for m in C.mother.v(n) if F.otype.v(m) != F.otype.v(n)}\n", " from_edges = {m for m in Ci.mother.v(n) if F.otype.v(m) != F.otype.v(n)}\n", " checked_nodes.add(n)\n", " unchecked_nodes.discard(n)\n", " unchecked_nodes |= (to_edges | from_edges) - checked_nodes\n", " print('Iteration {:>3}\\n{:>3} unchecked nodes\\n{:>3} checked nodes'.format(\n", " i,\n", " len(unchecked_nodes),\n", " len(checked_nodes),\n", " ))\n", " # what have we got?\n", " for n in sorted(checked_nodes):\n", " print('{}\\t{}'.format(n, F.otype.v(n)))\n", " domain[feat] = checked_nodes\n", " print('{} nodes'.format(len(checked_nodes)))" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Showing edges for \"mother\"\n", "1253742\t1253741\t\n", "Showing edges for \"functional_parent\"\n", "0-1,858317\t605143\t\n", "2,858318\t605144\t\n", "3,858319\t605145\t\n", "4-10,858320\t605146\t\n", "426581,1189402\t1125832\t\n", "514581,605143-605146\t426581\t\n" ] } ], "source": [ "for feat in efeats:\n", " print('Showing edges for \"{}\"'.format(feat))\n", " condensed_end_points = collections.defaultdict(set)\n", " for n in domain[feat]:\n", " edges = set(C.item[feat].v(n))\n", " if len(edges):\n", " condensed_end_points[ranges_rep(convert_to_ranges(edges))].add(n)\n", " condensed = collections.defaultdict(set)\n", " for (end_points_rep, start_points) in condensed_end_points.items():\n", " condensed[ranges_rep(convert_to_ranges(start_points))].add(end_points_rep)\n", " for (start_points_rep, end_points_rep) in sorted(condensed.items()):\n", " print('{}\\t{}\\t'.format(start_points_rep, ','.join(sorted(end_points_rep))))\n", " " ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "## Export the monads edge feature compactly\n", "\n", "We have already a mapping from nodes to monad set: the monads feature.\n", "We also want to collapse lines that assign the same monadset to several nodes.\n", "And, to be sure, we normalize all ranges we encounter, not assuming that they are already normalized.\n", "\n", "**NB** For the monads feature this is not the best way. We gain more space efficiency by listing for each node its\n", "associated monads on one line, because then we can leave out the node number.\n", "We implement that in the next cell." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18m 04s Normalizing the monads feature\n", "18m 05s 100000 nodes\n", "18m 06s 200000 nodes\n", "18m 07s 300000 nodes\n", "18m 09s 400000 nodes\n", "18m 11s 500000 nodes\n", "18m 12s 600000 nodes\n", "18m 13s 700000 nodes\n", "18m 14s 800000 nodes\n", "18m 15s 900000 nodes\n", "18m 17s 1000000 nodes\n", "18m 17s 1010313 nodes map to 527633 monad sets\n" ] } ], "source": [ "normalized_monads = collections.defaultdict(set)\n", "chunksize = 100000\n", "i = 0\n", "j = 0\n", "inf('Normalizing the monads feature')\n", "for n in NN():\n", " if F.otype.v(n) == 'word': continue\n", " i += 1\n", " j += 1\n", " if j == chunksize:\n", " j = 0\n", " inf('{:>7} nodes'.format(i))\n", " normalized_monads[\n", " ranges_rep(convert_to_ranges(tfFromMonadList(ranges_to_list(ranges_from_text(F.monads.v(n))))))\n", " ].add(n)\n", "inf('{:>7} nodes map to {:>7} monad sets'.format(i, len(normalized_monads)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now condense the list." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18m 27s Condensing\n", "18m 33s Done\n" ] } ], "source": [ "inf('Condensing')\n", "of = outfile('monads_plain.tf')\n", "condensed_monads = collections.defaultdict(set)\n", "for (monads_rep, node_set) in normalized_monads.items():\n", " condensed_monads[ranges_rep(convert_to_ranges(node_set))].add(monads_rep)\n", "for (node_set_rep, monad_set_rep) in sorted(condensed_monads.items()):\n", " of.write('{}\\t{}\\t\\n'.format(node_set_rep, ','.join(sorted(monad_set_rep))))\n", "of.close()\n", "inf('Done')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the other way of representing the monads edge feature" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1436893" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max(ranges[otype][-1][1] for otype in otypes[1:])" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "20m 39s Writing the monads feature\n", "20m 49s 1010313 nodes written\n" ] } ], "source": [ "inf('Writing the monads feature')\n", "of = outfile('monads.tf')\n", "i = 0\n", "start_node = ranges[otypes[0]][-1][1] + 1 # one more than the last word\n", "end_node = max(ranges[otype][-1][1] for otype in otypes[1:])\n", "\n", "of.write('{}\\t'.format(start_node))\n", "for n in range(start_node, end_node+1):\n", " i+=1\n", " of.write('{}\\t\\n'.format(\n", " ranges_rep(convert_to_ranges(tfFromMonadList(ranges_to_list(ranges_from_text(F.monads.v(n))))))\n", " ))\n", "of.close()\n", "inf('{:>7} nodes written'.format(i))" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "55m 24s Loading monads\n", "55m 31s Monads loaded monad specs for 1010313 nodes\n" ] } ], "source": [ "inf('Loading monads')\n", "fl = infile('monads.tf')\n", "monads_intern = collections.defaultdict(set)\n", "for line in fl:\n", " (nodes, monads, label) = line.rstrip('\\n').split('\\t')\n", " node_list = ranges_to_list(ranges_from_text(nodes))\n", " monads_list = ranges_to_list(ranges_from_text(monads))\n", " for n in node_list:\n", " monads_intern[n] |= set(monads_list)\n", "fl.close()\n", "inf('Monads loaded monad specs for {} nodes'.format(len(monads_intern)))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }