import streamlit as st import pandas as pd # Custom CSS for Styling st.markdown(""" """, unsafe_allow_html=True) # Main Title st.markdown('
With the ever-growing need to bridge language barriers in today's globalized world, multilingual text translation has become more important than ever. The MarianMT model, a fast and efficient neural machine translation framework, is built on the Transformer architecture and supports over 1,000 translation directions. This guide will demonstrate how to utilize MarianMT within Spark NLP to perform high-quality translations across multiple languages.
MarianMT is a neural machine translation framework developed by the Marian project, primarily backed by Microsoft Translator. It is a highly efficient tool, capable of translating text between numerous languages with remarkable speed and accuracy. MarianMT is implemented in C++ and is used in various industrial and research applications.
In this section, we will build a Spark NLP pipeline that uses the MarianMT model to translate English text into Chinese. We'll demonstrate the translation process from data preparation to the final output.
We'll begin by creating a Spark DataFrame containing the English text that we want to translate into Chinese.
""", unsafe_allow_html=True) st.code(""" data = [["Hello, how are you?"]] df = spark.createDataFrame(data).toDF("text") """, language="python") # Step 2: Assembling the Pipeline st.markdown("""We will now set up a Spark NLP pipeline that includes a document assembler, a sentence detector, and the MarianMT model for translation.
""", unsafe_allow_html=True) st.code(""" from sparknlp.base import * from sparknlp.annotator import * from pyspark.ml import Pipeline document_assembler = DocumentAssembler()\\ .setInputCol("text")\\ .setOutputCol("document") sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\\ .setInputCols(["document"])\\ .setOutputCol("sentences") marian = MarianTransformer.pretrained("opus_mt_en_zh", "xx")\\ .setInputCols(["sentences"])\\ .setOutputCol("translation") pipeline = Pipeline(stages=[document_assembler, sentence_detector, marian]) model = pipeline.fit(df) result = model.transform(df) """, language="python") # Step 3: Viewing the Results st.markdown("""After processing the text, we can view the translations generated by the MarianMT model:
""", unsafe_allow_html=True) st.code(""" result.select("translation.result").show(truncate=False) """, language="python") st.text(""" +--------------+ |result | +--------------+ |[你好,你好吗?] | +--------------+ """) # Model Information and Use Cases st.markdown("""The MarianMT model is highly versatile, supporting numerous translation directions. Here’s a brief overview of its characteristics:
By integrating MarianMT with Spark NLP, you can easily perform high-quality translations across various languages, leveraging the power of distributed computing. The example provided here demonstrates how to translate English text to Chinese using the opus_mt_en_zh
model. Whether you’re working with small-scale text or massive datasets, this approach offers scalability and flexibility.