Spaces:

spark-nlp
/

sparknlp-t5-open-book-question-answering

Sleeping

App Files Files Community

abdullahmubeen10 commited on Aug 31, 2024

Commit

9429ca9

verified ·

1 Parent(s): f30b416

Upload 5 files

Browse files

Files changed (5) hide show

.streamlit/config.toml +3 -0
Demo.py +129 -0
Dockerfile +72 -0
pages/Workflow & Model Overview.py +180 -0
requirements.txt +7 -0

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,3 @@

+[theme]
+base="light"
+primaryColor="#29B4E8"

Demo.py ADDED Viewed

	@@ -0,0 +1,129 @@

+import streamlit as st
+import sparknlp
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# CSS for styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 10px;
+            border-radius: 10px;
+            margin-top: 10px;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+    </style>
+""", unsafe_allow_html=True)
+@st.cache_resource
+def init_spark():
+    return sparknlp.start()
+@st.cache_resource
+def create_pipeline(model):
+    document_assembler = DocumentAssembler() \
+        .setInputCol("text") \
+        .setOutputCol("documents")
+    t5 = T5Transformer() \
+        .pretrained(model) \
+        .setTask("question:")\
+        .setMaxOutputLength(200)\
+        .setInputCols(["documents"]) \
+        .setOutputCol("answers")
+    pipeline = Pipeline(stages=[document_assembler, t5])
+    return pipeline
+def fit_data(pipeline, data):
+    df = spark.createDataFrame([[data]]).toDF("text")
+    result = pipeline.fit(df).transform(df)
+    return result.select('answers.result').collect()
+# Sidebar content
+model = st.sidebar.selectbox(
+    "Choose the pretrained model",
+    ['t5_base', 't5_small'],
+    help="For more info about the models visit: https://sparknlp.org/models"
+)
+# Set up the page layout
+title, sub_title = (
+    'Automatically Answer Questions (OPEN BOOK)',
+    'Automatically generate answers to questions without context.'
+)
+st.markdown(f'<div class="main-title">{title}</div>', unsafe_allow_html=True)
+st.write(sub_title)
+# Reference notebook link in sidebar
+link = """
+<a href="https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/streamlit_notebooks/public/QUESTION_ANSWERING_OPEN_BOOK.ipynb#scrollTo=SunNYkw3-Ic8">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" style="zoom: 1.3" alt="Open In Colab"/>
+</a>
+"""
+st.sidebar.markdown('Reference notebook:')
+st.sidebar.markdown(link, unsafe_allow_html=True)
+# Define the examples as a dictionary for easier access
+examples = {
+    "What does increased oxygen concentrations in the patient’s lungs displace?": """Hyperbaric (high-pressure) medicine uses special oxygen chambers to increase the partial pressure of O 2 around the patient and, when needed, the medical staff. Carbon monoxide poisoning, gas gangrene, and decompression sickness (the ’bends’) are sometimes treated using these devices. Increased O 2 concentration in the lungs helps to displace carbon monoxide from the heme group of hemoglobin. Oxygen gas is poisonous to the anaerobic bacteria that cause gas gangrene, so increasing its partial pressure helps kill them. Decompression sickness occurs in divers who decompress too quickly after a dive, resulting in bubbles of inert gas, mostly nitrogen and helium, forming in their blood. Increasing the pressure of O 2 as soon as possible is part of the treatment.""",
+    "What category of game is Legend of Zelda: Twilight Princess?": """The Legend of Zelda: Twilight Princess (Japanese: ゼルダの伝説 トワイライトプリンセス, Hepburn: Zeruda no Densetsu: Towairaito Purinsesu?) is an action-adventure game developed and published by Nintendo for the GameCube and Wii home video game consoles. It is the thirteenth installment in the The Legend of Zelda series. Originally planned for release on the GameCube in November 2005, Twilight Princess was delayed by Nintendo to allow its developers to refine the game, add more content, and port it to the Wii. The Wii version was released alongside the console in North America in November 2006, and in Japan, Europe, and Australia the following month. The GameCube version was released worldwide in December 2006.""",
+    "Who is founder of Alibaba Group?": """Alibaba Group founder Jack Ma has made his first appearance since Chinese regulators cracked down on his business empire. His absence had fuelled speculation over his whereabouts amid increasing official scrutiny of his businesses. The billionaire met 100 rural teachers in China via a video meeting on Wednesday, according to local government media. Alibaba shares surged 5% on Hong Kong's stock exchange on the news.""",
+    "For what instrument did Frédéric write primarily for?": """Frédéric François Chopin (/ˈʃoʊpæn/; French pronunciation: [fʁe.de.ʁik fʁɑ̃.swa ʃɔ.pɛ̃]; 22 February or 1 March 1810 – 17 October 1849), born Fryderyk Franciszek Chopin,[n 1] was a Polish and French (by citizenship and birth of father) composer and a virtuoso pianist of the Romantic era, who wrote primarily for the solo piano. He gained and has maintained renown worldwide as one of the leading musicians of his era, whose "poetic genius was based on a professional technique that was without equal in his generation." Chopin was born in what was then the Duchy of Warsaw, and grew up in Warsaw, which after 1815 became part of Congress Poland. A child prodigy, he completed his musical education and composed his earlier works in Warsaw before leaving Poland at the age of 20, less than a month before the outbreak of the November 1830 Uprising.""",
+    "The most populated city in the United States is which city?": """New York—often called New York City or the City of New York to distinguish it from the State of New York, of which it is a part—is the most populous city in the United States and the center of the New York metropolitan area, the premier gateway for legal immigration to the United States and one of the most populous urban agglomerations in the world. A global power city, New York exerts a significant impact upon commerce, finance, media, art, fashion, research, technology, education, and entertainment, its fast pace defining the term New York minute. Home to the headquarters of the United Nations, New York is an important center for international diplomacy and has been described as the cultural and financial capital of the world."""
+}
+# Create a select box for predefined examples
+selected_text = st.selectbox('Select an Example:', list(examples.keys()))
+# Add input fields for custom question and context
+st.write('Try it yourself!')
+custom_input_question = st.text_input('Create a question')
+custom_input_context = st.text_input("Create its context")
+# Check if custom input is provided
+if custom_input_question and custom_input_context:
+    custom_example = {custom_input_question: custom_input_context}
+    selected_example_key = custom_input_question
+    selected_example_context = custom_input_context
+else:
+    # Fallback to the selected example from dropdown
+    selected_example_key = selected_text
+    selected_example_context = examples[selected_text]
+# Prepare the final selected_text
+selected_text = f'"""{selected_example_key}""" """context : {selected_example_context}"""'
+st.markdown('---')
+# Display the selected or custom example
+st.markdown(f"**Text:** {selected_example_key}")
+st.markdown(f"**Context:** {selected_example_context}")
+# Initialize Spark and create pipeline
+spark = init_spark()
+pipeline = create_pipeline(model)
+output = fit_data(pipeline, selected_text)
+# Display matched sentence
+output_text = "".join(output[0][0])
+st.markdown('---')
+st.markdown(f"Answer: **{output_text.title()}**")

Dockerfile ADDED Viewed

	@@ -0,0 +1,72 @@

+# Download base image ubuntu 18.04
+FROM ubuntu:18.04
+# Set environment variables
+ENV NB_USER jovyan
+ENV NB_UID 1000
+ENV HOME /home/${NB_USER}
+ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
+# Install required packages
+RUN apt-get update && apt-get install -y \
+    tar \
+    wget \
+    bash \
+    rsync \
+    gcc \
+    libfreetype6-dev \
+    libhdf5-serial-dev \
+    libpng-dev \
+    libzmq3-dev \
+    python3 \
+    python3-dev \
+    python3-pip \
+    unzip \
+    pkg-config \
+    software-properties-common \
+    graphviz \
+    openjdk-8-jdk \
+    ant \
+    ca-certificates-java \
+    && apt-get clean \
+    && update-ca-certificates -f
+# Install Python 3.8 and pip
+RUN add-apt-repository ppa:deadsnakes/ppa \
+    && apt-get update \
+    && apt-get install -y python3.8 python3-pip \
+    && apt-get clean
+# Set up JAVA_HOME
+RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/" >> /etc/profile \
+    && echo "export PATH=\$JAVA_HOME/bin:\$PATH" >> /etc/profile
+# Create a new user named "jovyan" with user ID 1000
+RUN useradd -m -u ${NB_UID} ${NB_USER}
+# Switch to the "jovyan" user
+USER ${NB_USER}
+# Set home and path variables for the user
+ENV HOME=/home/${NB_USER} \
+    PATH=/home/${NB_USER}/.local/bin:$PATH
+# Set up PySpark to use Python 3.8 for both driver and workers
+ENV PYSPARK_PYTHON=/usr/bin/python3.8
+ENV PYSPARK_DRIVER_PYTHON=/usr/bin/python3.8
+# Set the working directory to the user's home directory
+WORKDIR ${HOME}
+# Upgrade pip and install Python dependencies
+RUN python3.8 -m pip install --upgrade pip
+COPY requirements.txt /tmp/requirements.txt
+RUN python3.8 -m pip install -r /tmp/requirements.txt
+# Copy the application code into the container at /home/jovyan
+COPY --chown=${NB_USER}:${NB_USER} . ${HOME}
+# Expose port for Streamlit
+EXPOSE 7860
+# Define the entry point for the container
+ENTRYPOINT ["streamlit", "run", "Demo.py", "--server.port=7860", "--server.address=0.0.0.0"]

pages/Workflow & Model Overview.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import streamlit as st
+# Page configuration
+st.set_page_config(
+    layout="wide",
+    initial_sidebar_state="auto"
+)
+# Custom CSS for better styling
+st.markdown("""
+    <style>
+        .main-title {
+            font-size: 36px;
+            color: #4A90E2;
+            font-weight: bold;
+            text-align: center;
+        }
+        .sub-title {
+            font-size: 24px;
+            color: #4A90E2;
+            margin-top: 20px;
+        }
+        .section {
+            background-color: #f9f9f9;
+            padding: 15px;
+            border-radius: 10px;
+            margin-top: 20px;
+        }
+        .section h2 {
+            font-size: 22px;
+            color: #4A90E2;
+        }
+        .section p, .section ul {
+            color: #666666;
+        }
+        .link {
+            color: #4A90E2;
+            text-decoration: none;
+        }
+    </style>
+""", unsafe_allow_html=True)
+# Title
+st.markdown('<div class="main-title">Automatically Answer Questions (OPEN BOOK)</div>', unsafe_allow_html=True)
+# Introduction Section
+st.markdown("""
+<div class="section">
+    <p>Open-book question answering is a task where a model generates answers based on provided text or documents. Unlike closed-book models, open-book models utilize external sources to produce responses, making them more accurate and versatile in scenarios where the input text provides essential context.</p>
+    <p>This page explores how to implement an open-book question-answering pipeline using state-of-the-art NLP techniques. We use a T5 Transformer model, which is well-suited for generating detailed answers by leveraging the information contained within the input text.</p>
+</div>
+""", unsafe_allow_html=True)
+# T5 Transformer Overview
+st.markdown('<div class="sub-title">Understanding the T5 Transformer for Open-Book QA</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The T5 (Text-To-Text Transfer Transformer) model by Google excels in converting various NLP tasks into a unified text-to-text format. For open-book question answering, the model takes a question and relevant context as input, generating a detailed and contextually appropriate answer.</p>
+    <p>The T5 model's ability to utilize provided documents makes it especially powerful in applications where the accuracy of the response is enhanced by access to supporting information, such as research tools, educational applications, or any system where the input text contains critical data.</p>
+</div>
+""", unsafe_allow_html=True)
+# Performance Section
+st.markdown('<div class="sub-title">Performance and Benchmarks</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>In open-book settings, the T5 model has been benchmarked across various datasets, demonstrating its capability to generate accurate and comprehensive answers when given relevant context. Its performance has been particularly strong in tasks requiring a deep understanding of the input text to produce correct and context-aware responses.</p>
+    <p>Open-book T5 models are especially valuable in applications that require dynamic interaction with content, making them ideal for domains such as customer support, research, and educational technologies.</p>
+</div>
+""", unsafe_allow_html=True)
+# Implementation Section
+st.markdown('<div class="sub-title">Implementing Open-Book Question Answering</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>The following example demonstrates how to implement an open-book question answering pipeline using Spark NLP. The pipeline includes a document assembler and the T5 model to generate answers based on the input text.</p>
+</div>
+""", unsafe_allow_html=True)
+st.code('''
+from sparknlp.base import *
+from sparknlp.annotator import *
+from pyspark.ml import Pipeline
+from pyspark.sql.functions import col, expr
+document_assembler = DocumentAssembler()\\
+    .setInputCol("text")\\
+    .setOutputCol("documents")
+t5 = T5Transformer()\\
+    .pretrained(model_name)\\
+    .setTask("question:")\\
+    .setMaxOutputLength(200)\\
+    .setInputCols(["documents"])\\
+    .setOutputCol("answers")
+pipeline = Pipeline().setStages([document_assembler, t5])
+data = spark.createDataFrame([["What is the impact of climate change on polar bears?"]]).toDF("text")
+result = pipeline.fit(data).transform(data)
+result.select("answers.result").show(truncate=False)
+''', language='python')
+# Example Output
+st.text("""
++------------------------------------------------+
+|answers.result                                   |
++------------------------------------------------+
+|Climate change significantly affects polar ...  |
++------------------------------------------------+
+""")
+# Model Info Section
+st.markdown('<div class="sub-title">Choosing the Right Model for Open-Book QA</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <p>When selecting a model for open-book question answering, it's important to consider the specific needs of your application. Below are some of the available models, each offering different strengths based on their transformer architecture:</p>
+    <ul>
+        <li><b>t5_base</b>: A versatile model that provides strong performance on question-answering tasks, ideal for applications requiring detailed answers.</li>
+        <li><b>t5_small</b>: A more lightweight variant of T5, suitable for applications where resource efficiency is crucial, though it may not be as accurate as larger models.</li>
+        <li><b>albert_qa_xxlarge_tweetqa</b>: Based on the ALBERT architecture, this model is fine-tuned for the TweetQA dataset, making it effective for answering questions in shorter text formats.</li>
+        <li><b>bert_qa_callmenicky_finetuned_squad</b>: A fine-tuned BERT model that offers a good balance between accuracy and computational efficiency, suitable for general-purpose QA tasks.</li>
+        <li><b>deberta_v3_xsmall_qa_squad2</b>: A smaller DeBERTa model, optimized for high accuracy on SQuAD2 while being resource-efficient, making it great for smaller deployments.</li>
+        <li><b>distilbert_base_cased_qa_squad2</b>: A distilled version of BERT, offering faster inference times with slightly reduced accuracy, suitable for environments with limited resources.</li>
+        <li><b>longformer_qa_large_4096_finetuned_triviaqa</b>: This model is particularly well-suited for open-book QA tasks involving long documents, as it can handle extended contexts effectively.</li>
+        <li><b>roberta_qa_roberta_base_squad2_covid</b>: A RoBERTa-based model fine-tuned for COVID-related QA, making it highly specialized for health-related domains.</li>
+        <li><b>roberta_qa_CV_Merge_DS</b>: Another RoBERTa model, fine-tuned on a diverse dataset, offering versatility across different domains and question types.</li>
+        <li><b>xlm_roberta_base_qa_squad2</b>: A multilingual model fine-tuned on SQuAD2, ideal for QA tasks across various languages.</li>
+    </ul>
+    <p>Among these models, <b>t5_base</b> and <b>longformer_qa_large_4096_finetuned_triviaqa</b> are highly recommended for their strong performance in generating accurate and contextually rich answers, especially in scenarios with long input texts. For faster responses with an emphasis on efficiency, <b>distilbert_base_cased_qa_squad2</b> and <b>deberta_v3_xsmall_qa_squad2</b> are excellent choices. Specialized tasks may benefit from models like <b>albert_qa_xxlarge_tweetqa</b> or <b>roberta_qa_roberta_base_squad2_covid</b>, depending on the domain.</p>
+    <p>Explore the available models on the <a class="link" href="https://sparknlp.org/models?annotator=T5Transformer" target="_blank">Spark NLP Models Hub</a> to find the one that best suits your needs.</p>
+</div>
+""", unsafe_allow_html=True)
+# Footer
+# References Section
+st.markdown('<div class="sub-title">References</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html" target="_blank">Google AI Blog</a>: Exploring Transfer Learning with T5</li>
+        <li><a class="link" href="https://sparknlp.org/models?annotator=T5Transformer" target="_blank">Spark NLP Model Hub</a>: Explore T5 models</li>
+        <li><a class="link" href="https://github.com/google-research/text-to-text-transfer-transformer" target="_blank">GitHub</a>: T5 Transformer repository</li>
+        <li><a class="link" href="https://arxiv.org/abs/1910.10683" target="_blank">T5 Paper</a>: Detailed insights from the developers</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li>
+        <li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li>
+        <li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Spark NLP articles</li>
+        <li><a class="link" href="https://www.youtube.com/channel/UCmFOjlpYEhxf_wJUDuz6xxQ/videos" target="_blank">YouTube</a>: Video tutorials</li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)
+st.markdown('<div class="sub-title">Quick Links</div>', unsafe_allow_html=True)
+st.markdown("""
+<div class="section">
+    <ul>
+        <li><a class="link" href="https://sparknlp.org/docs/en/quickstart" target="_blank">Getting Started</a></li>
+        <li><a class="link" href="https://nlp.johnsnowlabs.com/models" target="_blank">Pretrained Models</a></li>
+        <li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/annotation/text/english" target="_blank">Example Notebooks</a></li>
+        <li><a class="link" href="https://sparknlp.org/docs/en/install" target="_blank">Installation Guide</a></li>
+    </ul>
+</div>
+""", unsafe_allow_html=True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+streamlit
+st-annotated-text
+streamlit-tags
+pandas
+numpy
+spark-nlp
+pyspark