Spaces:
Running
Running
import streamlit as st | |
import pandas as pd | |
# Custom CSS for Styling | |
st.markdown(""" | |
<style> | |
.main-title { | |
font-size: 36px; | |
color: #4A90E2; | |
font-weight: bold; | |
text-align: center; | |
} | |
.sub-title { | |
font-size: 24px; | |
color: #4A90E2; | |
margin-top: 20px; | |
} | |
.section { | |
background-color: #f9f9f9; | |
padding: 15px; | |
border-radius: 10px; | |
margin-top: 20px; | |
} | |
.section p, .section ul { | |
color: #666666; | |
} | |
.link { | |
color: #4A90E2; | |
text-decoration: none; | |
} | |
h2 { | |
color: #4A90E2; | |
font-size: 28px; | |
margin-top: 30px; | |
} | |
h3 { | |
color: #4A90E2; | |
font-size: 22px; | |
margin-top: 20px; | |
} | |
h4 { | |
color: #4A90E2; | |
font-size: 18px; | |
margin-top: 15px; | |
} | |
</style> | |
""", unsafe_allow_html=True) | |
# Main Title | |
st.markdown('<div class="main-title">Multilingual Text Translation with Spark NLP and MarianMT</div>', unsafe_allow_html=True) | |
# Overview Section | |
st.markdown(""" | |
<div class="section"> | |
<p>With the ever-growing need to bridge language barriers in today's globalized world, multilingual text translation has become more important than ever. The MarianMT model, a fast and efficient neural machine translation framework, is built on the Transformer architecture and supports over 1,000 translation directions. This guide will demonstrate how to utilize MarianMT within Spark NLP to perform high-quality translations across multiple languages.</p> | |
</div> | |
""", unsafe_allow_html=True) | |
# Introduction to MarianMT and Spark NLP | |
st.markdown('<div class="sub-title">What is MarianMT?</div>', unsafe_allow_html=True) | |
# What is MarianMT? | |
st.markdown(""" | |
<div class="section"> | |
<p>MarianMT is a neural machine translation framework developed by the Marian project, primarily backed by Microsoft Translator. It is a highly efficient tool, capable of translating text between numerous languages with remarkable speed and accuracy. MarianMT is implemented in C++ and is used in various industrial and research applications.</p> | |
</div> | |
""", unsafe_allow_html=True) | |
# Pipeline and Results | |
st.markdown('<div class="sub-title">Pipeline and Results</div>', unsafe_allow_html=True) | |
st.markdown(""" | |
<div class="section"> | |
<p>In this section, we will build a Spark NLP pipeline that uses the MarianMT model to translate English text into Chinese. We'll demonstrate the translation process from data preparation to the final output.</p> | |
</div> | |
""", unsafe_allow_html=True) | |
# Step 1: Creating the Data | |
st.markdown(""" | |
<div class="section"> | |
<h4>Step 1: Creating the Data</h4> | |
<p>We'll begin by creating a Spark DataFrame containing the English text that we want to translate into Chinese.</p> | |
""", unsafe_allow_html=True) | |
st.code(""" | |
data = [["Hello, how are you?"]] | |
df = spark.createDataFrame(data).toDF("text") | |
""", language="python") | |
# Step 2: Assembling the Pipeline | |
st.markdown(""" | |
<div class="section"> | |
<h4>Step 2: Assembling the Pipeline</h4> | |
<p>We will now set up a Spark NLP pipeline that includes a document assembler, a sentence detector, and the MarianMT model for translation.</p> | |
""", unsafe_allow_html=True) | |
st.code(""" | |
from sparknlp.base import * | |
from sparknlp.annotator import * | |
from pyspark.ml import Pipeline | |
document_assembler = DocumentAssembler()\\ | |
.setInputCol("text")\\ | |
.setOutputCol("document") | |
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\\ | |
.setInputCols(["document"])\\ | |
.setOutputCol("sentences") | |
marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\\ | |
.setInputCols(["sentences"])\\ | |
.setOutputCol("translation") | |
pipeline = Pipeline(stages=[document_assembler, sentence_detector, marian]) | |
model = pipeline.fit(df) | |
result = model.transform(df) | |
""", language="python") | |
# Step 3: Viewing the Results | |
st.markdown(""" | |
<div class="section"> | |
<h4>Step 3: Viewing the Results</h4> | |
<p>After processing the text, we can view the translations generated by the MarianMT model:</p> | |
""", unsafe_allow_html=True) | |
st.code(""" | |
result.select("translation.result").show(truncate=False) | |
""", language="python") | |
st.text(""" | |
+--------------+ | |
|result | | |
+--------------+ | |
|[你好,你好吗?] | | |
+--------------+ | |
""") | |
# Model Information and Use Cases | |
st.markdown(""" | |
<div class="section"> | |
<h4>Model Information and Use Cases</h4> | |
<p>The MarianMT model is highly versatile, supporting numerous translation directions. Here’s a brief overview of its characteristics:</p> | |
<ul> | |
<li><b>Model Name:</b> opus_mt_en_zh</li> | |
<li><b>Input Language:</b> English (en)</li> | |
<li><b>Output Language:</b> Chinese (zh)</li> | |
<li><b>Best for:</b> General text translation from English to Chinese.</li> | |
<li><b>Compatibility:</b> Spark NLP 2.7.0+</li> | |
</ul> | |
</div> | |
""", unsafe_allow_html=True) | |
# Conclusion | |
st.markdown(""" | |
<div class="section"> | |
<h4>Conclusion</h4> | |
<p>By integrating MarianMT with Spark NLP, you can easily perform high-quality translations across various languages, leveraging the power of distributed computing. The example provided here demonstrates how to translate English text to Chinese using the <code>opus_mt_en_zh</code> model. Whether you’re working with small-scale text or massive datasets, this approach offers scalability and flexibility.</p> | |
</div> | |
""", unsafe_allow_html=True) | |
# References | |
st.markdown(""" | |
<div class="section"> | |
<h4>References</h4> | |
<ul> | |
<li>Model Documentation: <a class="link" href="https://sparknlp.org/models" target="_blank">Spark NLP Models</a></li> | |
<li>MarianMT Information: <a class="link" href="https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models" target="_blank">OPUS-MT Models</a></li> | |
<li>John Snow Labs: <a class="link" href="https://nlp.johnsnowlabs.com/" target="_blank">Spark NLP Documentation</a></li> | |
</ul> | |
</div> | |
""", unsafe_allow_html=True) | |
# Community & Support | |
st.markdown('<div class="sub-title">Community & Support</div>', unsafe_allow_html=True) | |
st.markdown(""" | |
<div class="section"> | |
<ul> | |
<li><a class="link" href="https://sparknlp.org/" target="_blank">Official Website</a>: Documentation and examples</li> | |
<li><a class="link" href="https://join.slack.com/t/spark-nlp/shared_invite/zt-198dipu77-L3UWNe_AJ8xqDk0ivmih5Q" target="_blank">Slack</a>: Live discussion with the community and team</li> | |
<li><a class="link" href="https://github.com/JohnSnowLabs/spark-nlp" target="_blank">GitHub</a>: Bug reports, feature requests, and contributions</li> | |
<li><a class="link" href="https://medium.com/spark-nlp" target="_blank">Medium</a>: Tutorials and articles</li> | |
</ul> | |
</div> | |
""", unsafe_allow_html=True) | |