{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"id": "c32bf0b9-1445-4ede-ae49-7dd63ff3b08e",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:52.002602Z",
"iopub.status.busy": "2024-01-17T01:35:52.001643Z",
"iopub.status.idle": "2024-01-17T01:35:52.021332Z",
"shell.execute_reply": "2024-01-17T01:35:52.018806Z",
"shell.execute_reply.started": "2024-01-17T01:35:52.002544Z"
}
},
"outputs": [],
"source": [
"# for use in tutorial and development; do not include this `sys.path` change in production:\n",
"import sys ; sys.path.insert(0, \"../\")"
]
},
{
"cell_type": "markdown",
"id": "c8ff5d81-110c-42ae-8aa7-ed4fffea40c6",
"metadata": {},
"source": [
"# bootstrap the _lemma graph_ with RDF triples"
]
},
{
"cell_type": "markdown",
"id": "1e847d0a-bc6c-470a-9fef-620ebbdbbbc3",
"metadata": {},
"source": [
"Show how to bootstrap definitions in a _lemma graph_ by loading RDF, e.g., for synonyms."
]
},
{
"cell_type": "markdown",
"id": "61d8d39a-23e4-48e7-b8f4-0dd724ccf586",
"metadata": {},
"source": [
"## environment"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "22489527-2ad5-4e3c-be23-f511e6bcf69f",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:52.030355Z",
"iopub.status.busy": "2024-01-17T01:35:52.029702Z",
"iopub.status.idle": "2024-01-17T01:35:59.577245Z",
"shell.execute_reply": "2024-01-17T01:35:59.576046Z",
"shell.execute_reply.started": "2024-01-17T01:35:52.030319Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"from icecream import ic\n",
"from pyinstrument import Profiler\n",
"import pyvis\n",
"\n",
"import textgraphs"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "438f5775-487b-493e-a172-59b652b94955",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.579567Z",
"iopub.status.busy": "2024-01-17T01:35:59.579060Z",
"iopub.status.idle": "2024-01-17T01:35:59.603599Z",
"shell.execute_reply": "2024-01-17T01:35:59.602072Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.579536Z"
}
},
"outputs": [],
"source": [
"%load_ext watermark"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "adc052dd-5cca-4d11-b543-3f0999f4f883",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.605959Z",
"iopub.status.busy": "2024-01-17T01:35:59.605459Z",
"iopub.status.idle": "2024-01-17T01:35:59.655730Z",
"shell.execute_reply": "2024-01-17T01:35:59.654417Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.605924Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last updated: 2024-01-16T17:35:59.608787-08:00\n",
"\n",
"Python implementation: CPython\n",
"Python version : 3.10.11\n",
"IPython version : 8.20.0\n",
"\n",
"Compiler : Clang 13.0.0 (clang-1300.0.29.30)\n",
"OS : Darwin\n",
"Release : 21.6.0\n",
"Machine : x86_64\n",
"Processor : i386\n",
"CPU cores : 8\n",
"Architecture: 64bit\n",
"\n"
]
}
],
"source": [
"%watermark"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "6e4618da-daf9-44c9-adbb-e5781dba5504",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.658604Z",
"iopub.status.busy": "2024-01-17T01:35:59.658083Z",
"iopub.status.idle": "2024-01-17T01:35:59.692941Z",
"shell.execute_reply": "2024-01-17T01:35:59.684789Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.658572Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"pyvis : 0.3.2\n",
"textgraphs: 0.5.0\n",
"sys : 3.10.11 (v3.10.11:7d4cc5aa85, Apr 4 2023, 19:05:19) [Clang 13.0.0 (clang-1300.0.29.30)]\n",
"\n"
]
}
],
"source": [
"%watermark --iversions"
]
},
{
"cell_type": "markdown",
"id": "23cefb5b-6ee7-4c33-8f82-a526cb9125d8",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-15T00:46:26.663615Z",
"iopub.status.busy": "2024-01-15T00:46:26.662220Z",
"iopub.status.idle": "2024-01-15T00:46:26.673766Z",
"shell.execute_reply": "2024-01-15T00:46:26.672702Z",
"shell.execute_reply.started": "2024-01-15T00:46:26.663477Z"
}
},
"source": [
"## load the bootstrap definitions"
]
},
{
"cell_type": "markdown",
"id": "89da700d-1e7f-4b24-901f-a36db8525add",
"metadata": {},
"source": [
"Define the bootstrap RDF triples in N3/Turtle format: we define an entity `Werner` as a synonym for `Werner Herzog` by using the [`skos:broader`](https://www.w3.org/TR/skos-reference/#semantic-relations) relation. Keep in mind that this entity may also refer to other Werners..."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e2412f6c-2c60-40d7-95f5-7bd281d522e7",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.695180Z",
"iopub.status.busy": "2024-01-17T01:35:59.694887Z",
"iopub.status.idle": "2024-01-17T01:35:59.711557Z",
"shell.execute_reply": "2024-01-17T01:35:59.704654Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.695127Z"
}
},
"outputs": [],
"source": [
"TTL_STR: str = \"\"\"\n",
"@base .\n",
"@prefix dbo: .\n",
"@prefix skos: .\n",
"\n",
" a dbo:Person ;\n",
" skos:prefLabel \"Werner\"@en .\n",
"\n",
" a dbo:Person ;\n",
" skos:prefLabel \"Werner Herzog\"@en.\n",
"\n",
"dbo:Person skos:definition \"People, including fictional\"@en ;\n",
" skos:prefLabel \"person\"@en .\n",
"\n",
" skos:broader .\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "7c567afd-2f44-4391-899a-da6aba3d222e",
"metadata": {},
"source": [
"Provide the source text"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "630430c5-21dc-4897-9a4b-3b01baf3de17",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.718153Z",
"iopub.status.busy": "2024-01-17T01:35:59.717788Z",
"iopub.status.idle": "2024-01-17T01:35:59.734747Z",
"shell.execute_reply": "2024-01-17T01:35:59.732341Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.718117Z"
}
},
"outputs": [],
"source": [
"SRC_TEXT: str = \"\"\" \n",
"Werner Herzog is a remarkable filmmaker and an intellectual originally from Germany, the son of Dietrich Herzog.\n",
"After the war, Werner fled to America to become famous.\n",
"\"\"\""
]
},
{
"cell_type": "markdown",
"id": "01152885-f301-49b1-ab61-f5b19d81c036",
"metadata": {},
"source": [
"set up the statistical stack profiling"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2a289117-301d-4027-ae1b-200201fb5f93",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.738759Z",
"iopub.status.busy": "2024-01-17T01:35:59.737750Z",
"iopub.status.idle": "2024-01-17T01:35:59.745742Z",
"shell.execute_reply": "2024-01-17T01:35:59.744107Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.738713Z"
}
},
"outputs": [],
"source": [
"profiler: Profiler = Profiler()\n",
"profiler.start()"
]
},
{
"cell_type": "markdown",
"id": "bf9d4f99-b82b-4d11-a9a4-31d0337f4aa8",
"metadata": {},
"source": [
"set up the `TextGraphs` pipeline"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "da6fcb0f-b2ac-4f74-af39-2c129c750cab",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:35:59.749862Z",
"iopub.status.busy": "2024-01-17T01:35:59.749122Z",
"iopub.status.idle": "2024-01-17T01:36:03.042323Z",
"shell.execute_reply": "2024-01-17T01:36:03.040676Z",
"shell.execute_reply.started": "2024-01-17T01:35:59.749790Z"
}
},
"outputs": [],
"source": [
"tg: textgraphs.TextGraphs = textgraphs.TextGraphs(\n",
" factory = textgraphs.PipelineFactory(\n",
" kg = textgraphs.KGWikiMedia(\n",
" spotlight_api = textgraphs.DBPEDIA_SPOTLIGHT_API,\n",
" dbpedia_search_api = textgraphs.DBPEDIA_SEARCH_API,\n",
" dbpedia_sparql_api = textgraphs.DBPEDIA_SPARQL_API,\n",
" \t\twikidata_api = textgraphs.WIKIDATA_API,\n",
" min_alias = textgraphs.DBPEDIA_MIN_ALIAS,\n",
" min_similarity = textgraphs.DBPEDIA_MIN_SIM,\n",
" ),\n",
" ),\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e6f98bbc-6954-4e39-b5d6-f726816bd5c7",
"metadata": {},
"source": [
"load the bootstrap definitions"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "321a9a90-ae80-47d7-b392-020b06bd3066",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.044027Z",
"iopub.status.busy": "2024-01-17T01:36:03.043746Z",
"iopub.status.idle": "2024-01-17T01:36:03.071058Z",
"shell.execute_reply": "2024-01-17T01:36:03.070258Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.043990Z"
}
},
"outputs": [],
"source": [
"tg.load_bootstrap_ttl(\n",
" TTL_STR,\n",
" debug = False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1db1fe56-52fe-4a01-9776-82908444dd6c",
"metadata": {},
"source": [
"parse the input text"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "f7f6665e-19da-4a25-a405-adbb5dfb3e88",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.072882Z",
"iopub.status.busy": "2024-01-17T01:36:03.072607Z",
"iopub.status.idle": "2024-01-17T01:36:03.751536Z",
"shell.execute_reply": "2024-01-17T01:36:03.750042Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.072843Z"
}
},
"outputs": [],
"source": [
"pipe: textgraphs.Pipeline = tg.create_pipeline(\n",
" SRC_TEXT.strip(),\n",
")\n",
"\n",
"tg.collect_graph_elements(\n",
" pipe,\n",
" debug = False,\n",
")\n",
"\n",
"tg.construct_lemma_graph(\n",
" debug = False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "3143955c-446a-4e6c-834c-583ab173f446",
"metadata": {},
"source": [
"## visualize the lemma graph"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "05b409af-14df-4158-9709-ffe2d79e864b",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.762865Z",
"iopub.status.busy": "2024-01-17T01:36:03.762378Z",
"iopub.status.idle": "2024-01-17T01:36:03.773217Z",
"shell.execute_reply": "2024-01-17T01:36:03.769536Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.762817Z"
},
"scrolled": true
},
"outputs": [],
"source": [
"render: textgraphs.RenderPyVis = tg.create_render()\n",
"\n",
"pv_graph: pyvis.network.Network = render.render_lemma_graph(\n",
" debug = False,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "7b5d3e88-6669-4df1-a20a-587cc6a7db12",
"metadata": {},
"source": [
"initialize the layout parameters"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "b212f5ed-03d6-439f-92ae-f2cbedb18609",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.776399Z",
"iopub.status.busy": "2024-01-17T01:36:03.775428Z",
"iopub.status.idle": "2024-01-17T01:36:03.784525Z",
"shell.execute_reply": "2024-01-17T01:36:03.783464Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.776310Z"
}
},
"outputs": [],
"source": [
"pv_graph.force_atlas_2based(\n",
" gravity = -38,\n",
" central_gravity = 0.01,\n",
" spring_length = 231,\n",
" spring_strength = 0.7,\n",
" damping = 0.8,\n",
" overlap = 0,\n",
")\n",
"\n",
"pv_graph.show_buttons(filter_ = [ \"physics\" ])\n",
"pv_graph.toggle_physics(True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2f952a7c-3130-49c9-b659-fb941e9e0bfe",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.788862Z",
"iopub.status.busy": "2024-01-17T01:36:03.787641Z",
"iopub.status.idle": "2024-01-17T01:36:03.848366Z",
"shell.execute_reply": "2024-01-17T01:36:03.847499Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.788773Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tmp.fig04.html\n"
]
},
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pv_graph.prep_notebook()\n",
"pv_graph.show(\"tmp.fig04.html\")"
]
},
{
"cell_type": "markdown",
"id": "e57d42a8-4414-4f27-9817-b9339e65346f",
"metadata": {},
"source": [
"Notice how the `Werner` and `Werner Herzog` nodes are now linked? This synonym from the bootstrap definitions above provided means to link more portions of the _lemma graph_ than the demo in `ex0_0` with the same input text."
]
},
{
"cell_type": "markdown",
"id": "ff49fe28-e75f-4590-8b87-0d8962928cba",
"metadata": {},
"source": [
"## statistical stack profile instrumentation"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "af4ecb06-370f-4077-9899-29a1673e4768",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.849937Z",
"iopub.status.busy": "2024-01-17T01:36:03.849635Z",
"iopub.status.idle": "2024-01-17T01:36:03.856645Z",
"shell.execute_reply": "2024-01-17T01:36:03.855799Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.849877Z"
}
},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"profiler.stop()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "d5ac2ce6-15b1-41ad-8215-8a5f76036cf1",
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-17T01:36:03.857987Z",
"iopub.status.busy": "2024-01-17T01:36:03.857704Z",
"iopub.status.idle": "2024-01-17T01:36:04.615855Z",
"shell.execute_reply": "2024-01-17T01:36:04.615084Z",
"shell.execute_reply.started": "2024-01-17T01:36:03.857962Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" _ ._ __/__ _ _ _ _ _/_ Recorded: 17:35:59 Samples: 2846\n",
" /_//_/// /_\\ / //_// / //_'/ // Duration: 4.111 CPU time: 3.294\n",
"/ _/ v4.6.1\n",
"\n",
"Program: /Users/paco/src/textgraphs/venv/lib/python3.10/site-packages/ipykernel_launcher.py -f /Users/paco/Library/Jupyter/runtime/kernel-4365d4ba-2d4d-4d4b-83e2-eb5ef8abfe26.json\n",
"\n",
"4.111 IPythonKernel.dispatch_shell ipykernel/kernelbase.py:378\n",
"└─ 4.075 IPythonKernel.execute_request ipykernel/kernelbase.py:721\n",
" [9 frames hidden] ipykernel, IPython\n",
" 3.995 ZMQInteractiveShell.run_ast_nodes IPython/core/interactiveshell.py:3394\n",
" ├─ 3.250 ../ipykernel_4433/1372904243.py:1\n",
" │ └─ 3.248 PipelineFactory.__init__ textgraphs/pipe.py:434\n",
" │ └─ 3.232 load spacy/__init__.py:27\n",
" │ [98 frames hidden] spacy, en_core_web_sm, catalogue, imp...\n",
" │ 0.496 tokenizer_factory spacy/language.py:110\n",
" │ └─ 0.108 _validate_special_case spacy/tokenizer.pyx:573\n",
" │ 0.439 spacy/language.py:2170\n",
" │ └─ 0.085 _validate_special_case spacy/tokenizer.pyx:573\n",
" ├─ 0.672 ../ipykernel_4433/3257668275.py:1\n",
" │ └─ 0.669 TextGraphs.create_pipeline textgraphs/doc.py:103\n",
" │ └─ 0.669 PipelineFactory.create_pipeline textgraphs/pipe.py:508\n",
" │ └─ 0.669 Pipeline.__init__ textgraphs/pipe.py:216\n",
" │ └─ 0.669 English.__call__ spacy/language.py:1016\n",
" │ [31 frames hidden] spacy, spacy_dbpedia_spotlight, reque...\n",
" └─ 0.055 ../ipykernel_4433/72966960.py:1\n",
" └─ 0.046 Network.prep_notebook pyvis/network.py:552\n",
" [5 frames hidden] pyvis, jinja2\n",
"\n",
"\n"
]
}
],
"source": [
"profiler.print()"
]
},
{
"cell_type": "markdown",
"id": "c47bcfd2-2bd6-49a5-8f1a-102d90edde39",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"## outro"
]
},
{
"cell_type": "markdown",
"id": "68bea4f9-aec2-4b28-8f08-a4034851d066",
"metadata": {},
"source": [
"_\\[ more parts are in progress, getting added to this demo \\]_"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.11"
}
},
"nbformat": 4,
"nbformat_minor": 5
}