{ "cells": [ { "cell_type": "code", "execution_count": 1, "id": "c32bf0b9-1445-4ede-ae49-7dd63ff3b08e", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:52.002602Z", "iopub.status.busy": "2024-01-17T01:35:52.001643Z", "iopub.status.idle": "2024-01-17T01:35:52.021332Z", "shell.execute_reply": "2024-01-17T01:35:52.018806Z", "shell.execute_reply.started": "2024-01-17T01:35:52.002544Z" } }, "outputs": [], "source": [ "# for use in tutorial and development; do not include this `sys.path` change in production:\n", "import sys ; sys.path.insert(0, \"../\")" ] }, { "cell_type": "markdown", "id": "c8ff5d81-110c-42ae-8aa7-ed4fffea40c6", "metadata": {}, "source": [ "# bootstrap the _lemma graph_ with RDF triples" ] }, { "cell_type": "markdown", "id": "1e847d0a-bc6c-470a-9fef-620ebbdbbbc3", "metadata": {}, "source": [ "Show how to bootstrap definitions in a _lemma graph_ by loading RDF, e.g., for synonyms." ] }, { "cell_type": "markdown", "id": "61d8d39a-23e4-48e7-b8f4-0dd724ccf586", "metadata": {}, "source": [ "## environment" ] }, { "cell_type": "code", "execution_count": 2, "id": "22489527-2ad5-4e3c-be23-f511e6bcf69f", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:52.030355Z", "iopub.status.busy": "2024-01-17T01:35:52.029702Z", "iopub.status.idle": "2024-01-17T01:35:59.577245Z", "shell.execute_reply": "2024-01-17T01:35:59.576046Z", "shell.execute_reply.started": "2024-01-17T01:35:52.030319Z" }, "scrolled": true }, "outputs": [], "source": [ "from icecream import ic\n", "from pyinstrument import Profiler\n", "import pyvis\n", "\n", "import textgraphs" ] }, { "cell_type": "code", "execution_count": 3, "id": "438f5775-487b-493e-a172-59b652b94955", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.579567Z", "iopub.status.busy": "2024-01-17T01:35:59.579060Z", "iopub.status.idle": "2024-01-17T01:35:59.603599Z", "shell.execute_reply": "2024-01-17T01:35:59.602072Z", "shell.execute_reply.started": "2024-01-17T01:35:59.579536Z" } }, "outputs": [], "source": [ "%load_ext watermark" ] }, { "cell_type": "code", "execution_count": 4, "id": "adc052dd-5cca-4d11-b543-3f0999f4f883", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.605959Z", "iopub.status.busy": "2024-01-17T01:35:59.605459Z", "iopub.status.idle": "2024-01-17T01:35:59.655730Z", "shell.execute_reply": "2024-01-17T01:35:59.654417Z", "shell.execute_reply.started": "2024-01-17T01:35:59.605924Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Last updated: 2024-01-16T17:35:59.608787-08:00\n", "\n", "Python implementation: CPython\n", "Python version : 3.10.11\n", "IPython version : 8.20.0\n", "\n", "Compiler : Clang 13.0.0 (clang-1300.0.29.30)\n", "OS : Darwin\n", "Release : 21.6.0\n", "Machine : x86_64\n", "Processor : i386\n", "CPU cores : 8\n", "Architecture: 64bit\n", "\n" ] } ], "source": [ "%watermark" ] }, { "cell_type": "code", "execution_count": 5, "id": "6e4618da-daf9-44c9-adbb-e5781dba5504", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.658604Z", "iopub.status.busy": "2024-01-17T01:35:59.658083Z", "iopub.status.idle": "2024-01-17T01:35:59.692941Z", "shell.execute_reply": "2024-01-17T01:35:59.684789Z", "shell.execute_reply.started": "2024-01-17T01:35:59.658572Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pyvis : 0.3.2\n", "textgraphs: 0.5.0\n", "sys : 3.10.11 (v3.10.11:7d4cc5aa85, Apr 4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)]\n", "\n" ] } ], "source": [ "%watermark --iversions" ] }, { "cell_type": "markdown", "id": "23cefb5b-6ee7-4c33-8f82-a526cb9125d8", "metadata": { "execution": { "iopub.execute_input": "2024-01-15T00:46:26.663615Z", "iopub.status.busy": "2024-01-15T00:46:26.662220Z", "iopub.status.idle": "2024-01-15T00:46:26.673766Z", "shell.execute_reply": "2024-01-15T00:46:26.672702Z", "shell.execute_reply.started": "2024-01-15T00:46:26.663477Z" } }, "source": [ "## load the bootstrap definitions" ] }, { "cell_type": "markdown", "id": "89da700d-1e7f-4b24-901f-a36db8525add", "metadata": {}, "source": [ "Define the bootstrap RDF triples in N3/Turtle format: we define an entity `Werner` as a synonym for `Werner Herzog` by using the [`skos:broader`](https://www.w3.org/TR/skos-reference/#semantic-relations) relation. Keep in mind that this entity may also refer to other Werners..." ] }, { "cell_type": "code", "execution_count": 6, "id": "e2412f6c-2c60-40d7-95f5-7bd281d522e7", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.695180Z", "iopub.status.busy": "2024-01-17T01:35:59.694887Z", "iopub.status.idle": "2024-01-17T01:35:59.711557Z", "shell.execute_reply": "2024-01-17T01:35:59.704654Z", "shell.execute_reply.started": "2024-01-17T01:35:59.695127Z" } }, "outputs": [], "source": [ "TTL_STR: str = \"\"\"\n", "@base .\n", "@prefix dbo: .\n", "@prefix skos: .\n", "\n", " a dbo:Person ;\n", " skos:prefLabel \"Werner\"@en .\n", "\n", " a dbo:Person ;\n", " skos:prefLabel \"Werner Herzog\"@en.\n", "\n", "dbo:Person skos:definition \"People, including fictional\"@en ;\n", " skos:prefLabel \"person\"@en .\n", "\n", " skos:broader .\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "7c567afd-2f44-4391-899a-da6aba3d222e", "metadata": {}, "source": [ "Provide the source text" ] }, { "cell_type": "code", "execution_count": 7, "id": "630430c5-21dc-4897-9a4b-3b01baf3de17", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.718153Z", "iopub.status.busy": "2024-01-17T01:35:59.717788Z", "iopub.status.idle": "2024-01-17T01:35:59.734747Z", "shell.execute_reply": "2024-01-17T01:35:59.732341Z", "shell.execute_reply.started": "2024-01-17T01:35:59.718117Z" } }, "outputs": [], "source": [ "SRC_TEXT: str = \"\"\" \n", "Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.\n", "After the war, Werner fled to America to become famous.\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "01152885-f301-49b1-ab61-f5b19d81c036", "metadata": {}, "source": [ "set up the statistical stack profiling" ] }, { "cell_type": "code", "execution_count": 8, "id": "2a289117-301d-4027-ae1b-200201fb5f93", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.738759Z", "iopub.status.busy": "2024-01-17T01:35:59.737750Z", "iopub.status.idle": "2024-01-17T01:35:59.745742Z", "shell.execute_reply": "2024-01-17T01:35:59.744107Z", "shell.execute_reply.started": "2024-01-17T01:35:59.738713Z" } }, "outputs": [], "source": [ "profiler: Profiler = Profiler()\n", "profiler.start()" ] }, { "cell_type": "markdown", "id": "bf9d4f99-b82b-4d11-a9a4-31d0337f4aa8", "metadata": {}, "source": [ "set up the `TextGraphs` pipeline" ] }, { "cell_type": "code", "execution_count": 9, "id": "da6fcb0f-b2ac-4f74-af39-2c129c750cab", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:35:59.749862Z", "iopub.status.busy": "2024-01-17T01:35:59.749122Z", "iopub.status.idle": "2024-01-17T01:36:03.042323Z", "shell.execute_reply": "2024-01-17T01:36:03.040676Z", "shell.execute_reply.started": "2024-01-17T01:35:59.749790Z" } }, "outputs": [], "source": [ "tg: textgraphs.TextGraphs = textgraphs.TextGraphs(\n", " factory = textgraphs.PipelineFactory(\n", " kg = textgraphs.KGWikiMedia(\n", " spotlight_api = textgraphs.DBPEDIA_SPOTLIGHT_API,\n", " dbpedia_search_api = textgraphs.DBPEDIA_SEARCH_API,\n", " dbpedia_sparql_api = textgraphs.DBPEDIA_SPARQL_API,\n", " \t\twikidata_api = textgraphs.WIKIDATA_API,\n", " min_alias = textgraphs.DBPEDIA_MIN_ALIAS,\n", " min_similarity = textgraphs.DBPEDIA_MIN_SIM,\n", " ),\n", " ),\n", ")" ] }, { "cell_type": "markdown", "id": "e6f98bbc-6954-4e39-b5d6-f726816bd5c7", "metadata": {}, "source": [ "load the bootstrap definitions" ] }, { "cell_type": "code", "execution_count": 10, "id": "321a9a90-ae80-47d7-b392-020b06bd3066", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.044027Z", "iopub.status.busy": "2024-01-17T01:36:03.043746Z", "iopub.status.idle": "2024-01-17T01:36:03.071058Z", "shell.execute_reply": "2024-01-17T01:36:03.070258Z", "shell.execute_reply.started": "2024-01-17T01:36:03.043990Z" } }, "outputs": [], "source": [ "tg.load_bootstrap_ttl(\n", " TTL_STR,\n", " debug = False,\n", ")" ] }, { "cell_type": "markdown", "id": "1db1fe56-52fe-4a01-9776-82908444dd6c", "metadata": {}, "source": [ "parse the input text" ] }, { "cell_type": "code", "execution_count": 11, "id": "f7f6665e-19da-4a25-a405-adbb5dfb3e88", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.072882Z", "iopub.status.busy": "2024-01-17T01:36:03.072607Z", "iopub.status.idle": "2024-01-17T01:36:03.751536Z", "shell.execute_reply": "2024-01-17T01:36:03.750042Z", "shell.execute_reply.started": "2024-01-17T01:36:03.072843Z" } }, "outputs": [], "source": [ "pipe: textgraphs.Pipeline = tg.create_pipeline(\n", " SRC_TEXT.strip(),\n", ")\n", "\n", "tg.collect_graph_elements(\n", " pipe,\n", " debug = False,\n", ")\n", "\n", "tg.construct_lemma_graph(\n", " debug = False,\n", ")" ] }, { "cell_type": "markdown", "id": "3143955c-446a-4e6c-834c-583ab173f446", "metadata": {}, "source": [ "## visualize the lemma graph" ] }, { "cell_type": "code", "execution_count": 12, "id": "05b409af-14df-4158-9709-ffe2d79e864b", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.762865Z", "iopub.status.busy": "2024-01-17T01:36:03.762378Z", "iopub.status.idle": "2024-01-17T01:36:03.773217Z", "shell.execute_reply": "2024-01-17T01:36:03.769536Z", "shell.execute_reply.started": "2024-01-17T01:36:03.762817Z" }, "scrolled": true }, "outputs": [], "source": [ "render: textgraphs.RenderPyVis = tg.create_render()\n", "\n", "pv_graph: pyvis.network.Network = render.render_lemma_graph(\n", " debug = False,\n", ")" ] }, { "cell_type": "markdown", "id": "7b5d3e88-6669-4df1-a20a-587cc6a7db12", "metadata": {}, "source": [ "initialize the layout parameters" ] }, { "cell_type": "code", "execution_count": 13, "id": "b212f5ed-03d6-439f-92ae-f2cbedb18609", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.776399Z", "iopub.status.busy": "2024-01-17T01:36:03.775428Z", "iopub.status.idle": "2024-01-17T01:36:03.784525Z", "shell.execute_reply": "2024-01-17T01:36:03.783464Z", "shell.execute_reply.started": "2024-01-17T01:36:03.776310Z" } }, "outputs": [], "source": [ "pv_graph.force_atlas_2based(\n", " gravity = -38,\n", " central_gravity = 0.01,\n", " spring_length = 231,\n", " spring_strength = 0.7,\n", " damping = 0.8,\n", " overlap = 0,\n", ")\n", "\n", "pv_graph.show_buttons(filter_ = [ \"physics\" ])\n", "pv_graph.toggle_physics(True)" ] }, { "cell_type": "code", "execution_count": 14, "id": "2f952a7c-3130-49c9-b659-fb941e9e0bfe", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.788862Z", "iopub.status.busy": "2024-01-17T01:36:03.787641Z", "iopub.status.idle": "2024-01-17T01:36:03.848366Z", "shell.execute_reply": "2024-01-17T01:36:03.847499Z", "shell.execute_reply.started": "2024-01-17T01:36:03.788773Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tmp.fig04.html\n" ] }, { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pv_graph.prep_notebook()\n", "pv_graph.show(\"tmp.fig04.html\")" ] }, { "cell_type": "markdown", "id": "e57d42a8-4414-4f27-9817-b9339e65346f", "metadata": {}, "source": [ "Notice how the `Werner` and `Werner Herzog` nodes are now linked? This synonym from the bootstrap definitions above provided means to link more portions of the _lemma graph_ than the demo in `ex0_0` with the same input text." ] }, { "cell_type": "markdown", "id": "ff49fe28-e75f-4590-8b87-0d8962928cba", "metadata": {}, "source": [ "## statistical stack profile instrumentation" ] }, { "cell_type": "code", "execution_count": 15, "id": "af4ecb06-370f-4077-9899-29a1673e4768", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.849937Z", "iopub.status.busy": "2024-01-17T01:36:03.849635Z", "iopub.status.idle": "2024-01-17T01:36:03.856645Z", "shell.execute_reply": "2024-01-17T01:36:03.855799Z", "shell.execute_reply.started": "2024-01-17T01:36:03.849877Z" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "profiler.stop()" ] }, { "cell_type": "code", "execution_count": 16, "id": "d5ac2ce6-15b1-41ad-8215-8a5f76036cf1", "metadata": { "execution": { "iopub.execute_input": "2024-01-17T01:36:03.857987Z", "iopub.status.busy": "2024-01-17T01:36:03.857704Z", "iopub.status.idle": "2024-01-17T01:36:04.615855Z", "shell.execute_reply": "2024-01-17T01:36:04.615084Z", "shell.execute_reply.started": "2024-01-17T01:36:03.857962Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " _ ._ __/__ _ _ _ _ _/_ Recorded: 17:35:59 Samples: 2846\n", " /_//_/// /_\\ / //_// / //_'/ // Duration: 4.111 CPU time: 3.294\n", "/ _/ v4.6.1\n", "\n", "Program: /Users/paco/src/textgraphs/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f /Users/paco/Library/Jupyter/runtime/kernel-4365d4ba-2d4d-4d4b-83e2-eb5ef8abfe26.json\n", "\n", "4.111 IPythonKernel.dispatch_shell ipykernel/kernelbase.py:378\n", "└─ 4.075 IPythonKernel.execute_request ipykernel/kernelbase.py:721\n", " [9 frames hidden] ipykernel, IPython\n", " 3.995 ZMQInteractiveShell.run_ast_nodes IPython/core/interactiveshell.py:3394\n", " ├─ 3.250 ../ipykernel_4433/1372904243.py:1\n", " │ └─ 3.248 PipelineFactory.__init__ textgraphs/pipe.py:434\n", " │ └─ 3.232 load spacy/__init__.py:27\n", " │ [98 frames hidden] spacy, en_core_web_sm, catalogue, imp...\n", " │ 0.496 tokenizer_factory spacy/language.py:110\n", " │ └─ 0.108 _validate_special_case spacy/tokenizer.pyx:573\n", " │ 0.439 spacy/language.py:2170\n", " │ └─ 0.085 _validate_special_case spacy/tokenizer.pyx:573\n", " ├─ 0.672 ../ipykernel_4433/3257668275.py:1\n", " │ └─ 0.669 TextGraphs.create_pipeline textgraphs/doc.py:103\n", " │ └─ 0.669 PipelineFactory.create_pipeline textgraphs/pipe.py:508\n", " │ └─ 0.669 Pipeline.__init__ textgraphs/pipe.py:216\n", " │ └─ 0.669 English.__call__ spacy/language.py:1016\n", " │ [31 frames hidden] spacy, spacy_dbpedia_spotlight, reque...\n", " └─ 0.055 ../ipykernel_4433/72966960.py:1\n", " └─ 0.046 Network.prep_notebook pyvis/network.py:552\n", " [5 frames hidden] pyvis, jinja2\n", "\n", "\n" ] } ], "source": [ "profiler.print()" ] }, { "cell_type": "markdown", "id": "c47bcfd2-2bd6-49a5-8f1a-102d90edde39", "metadata": { "jp-MarkdownHeadingCollapsed": true }, "source": [ "## outro" ] }, { "cell_type": "markdown", "id": "68bea4f9-aec2-4b28-8f08-a4034851d066", "metadata": {}, "source": [ "_\\[ more parts are in progress, getting added to this demo \\]_" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" } }, "nbformat": 4, "nbformat_minor": 5 }