## Setting Up

In [None]:
%%capture

%pip install langchain langchain-community
%pip install langchainhub
%pip install langchain-chroma
%pip install langchain-groq
%pip install langchain-huggingface
%pip install gradio

In [None]:
from google.colab import userdata

groq_api_key = userdata.get('GROQ_API_KEY')

In [None]:
!unzip /content/archive.zip -d /content/KaggleX_Starwars

Archive:  /content/archive.zip
  inflating: /content/KaggleX_Starwars/csv/battles.csv  
  inflating: /content/KaggleX_Starwars/csv/characters.csv  
  inflating: /content/KaggleX_Starwars/csv/cities.csv  
  inflating: /content/KaggleX_Starwars/csv/droids.csv  
  inflating: /content/KaggleX_Starwars/csv/events.csv  
  inflating: /content/KaggleX_Starwars/csv/films.csv  
  inflating: /content/KaggleX_Starwars/csv/music.csv  
  inflating: /content/KaggleX_Starwars/csv/organizations.csv  
  inflating: /content/KaggleX_Starwars/csv/planets.csv  
  inflating: /content/KaggleX_Starwars/csv/quotes.csv  
  inflating: /content/KaggleX_Starwars/csv/species.csv  
  inflating: /content/KaggleX_Starwars/csv/starships.csv  
  inflating: /content/KaggleX_Starwars/csv/timeline.csv  
  inflating: /content/KaggleX_Starwars/csv/vehicles.csv  
  inflating: /content/KaggleX_Starwars/csv/weapons.csv  
  inflating: /content/KaggleX_Starwars/parquet_files/battles.parquet  
  inflating: /content/KaggleX_Starwars

## Groq Python API

In [None]:
from groq import Groq

client = Groq(
   api_key=groq_api_key,
)


chat_streaming = client.chat.completions.create(
    messages=[
       {"role": "system", "content": "You are a professional Data Engineer."},
       {"role": "user", "content": "Can you explain how the data lake works?"},
    ],
    model="llama-3.1-8b-instant",
    temperature=0.3,
    max_tokens=1200,
    top_p=1,
    stop=None,
    stream=True,
)

for chunk in chat_streaming:
    print(chunk.choices[0].delta.content, end="")

A data lake is a centralized repository that stores raw, unprocessed data in its native format, allowing for easy access, processing, and analysis. Here's a breakdown of how a data lake works:

**Key Components:**

1. **Data Ingestion**: Data is collected from various sources, such as databases, APIs, files, and IoT devices. This data is then ingested into the data lake using tools like Apache NiFi, Apache Flume, or AWS Kinesis.
2. **Data Storage**: The ingested data is stored in a scalable and cost-effective storage system, such as Hadoop Distributed File System (HDFS), Amazon S3, or Azure Data Lake Storage (ADLS).
3. **Data Processing**: Data is processed using various tools and frameworks, such as Apache Spark, Apache Flink, or AWS Glue, to transform, aggregate, and analyze the data.
4. **Data Governance**: Data governance ensures that data is properly managed, secured, and compliant with regulations. This includes data quality, metadata management, and access control.

**Data Lake 

## Initiating LLM

In [None]:
from langchain_groq import ChatGroq

llm = ChatGroq(model="llama-3.1-70b-versatile",api_key=groq_api_key)

## Initiating Embedding Model

In [None]:
from langchain_huggingface import HuggingFaceEmbeddings
embed_model = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

## Loading the CSV files
Dataset is from https://www.kaggle.com/datasets/jsphyg/star-wars

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("/Data/csv", glob="**/*.csv", loader_cls=CSVLoader)

data = loader.load()

In [None]:
len(data)

488

## Setting up VectorStore

In [None]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
    documents=data,
    embedding=embed_model,
    persist_directory="/content/Starwars_Vectordb",
)


In [None]:
query = "Which battle resulted in Rebel Victory?"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)

id: 3
name: Battle of Endor
location: Endor
date: 4 ABY
result: Rebel Victory
participants: Rebel Alliance, Galactic Empire


## Building Components for RAG Chain

In [None]:
retriever = vectorstore.as_retriever()

In [None]:
from langchain_core.prompts import PromptTemplate

template = ("""You are a Star Wars assistant for answering questions.
    Use the provided context to answer the question.
    If you don't know the answer, say so. Explain your answer in detail.
    Do not discuss the context in your response; just provide the answer directly.

    Context: {context}

    Question: {question}

    Answer:""")

rag_prompt = PromptTemplate.from_template(template)

## Building the RAG Chain

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

## Testing the Chain

In [None]:
from IPython.display import display, Markdown

response = rag_chain.invoke("Which battle resulted in Rebel Victory?")
Markdown(response)

The Battle of Endor and the Battle of Yavin resulted in Rebel Victory.

In [None]:
query = "What is the timeline of the Starwars?"

for chunk in rag_chain.stream(query):
            print(chunk, end="")

Based on the provided documents, the timeline of the Star Wars cannot be determined with certainty. However, there are two relevant pieces of information:

1. The Battle of Endor occurred 4 years after the Battle of Yavin (presumably the Battle of Yavin occurred at the time of the events in Episode IV: A New Hope, as Episode IV: A New Hope is the fourth installment in the original trilogy, and the title of the document with the Battle of Endor is "id: 3\nevent: Battle of Endor\nyear: 4 ABY").
2. Based on the release dates of the films, the order of the films is as follows: Episode I: The Phantom Menace (1999), Episode II: (not provided in the documents), Episode III: (not provided in the documents), Episode IV: A New Hope (1977), Episode V: The Empire Strikes Back (1980), Episode VI: (not provided in the documents), Episode VII: (not provided in the documents), Episode VIII: (not provided in the documents), Episode IX: (not provided in the documents), Episode X: (not provided in the do

## Gradio App

In [None]:
import gradio as gr

def rag_memory_stream(text):
    partial_text = ""
    for new_text in rag_chain.stream(text):
        partial_text += new_text
        # Yield an empty string to cleanup the message textbox and the updated conversation history
        yield partial_text


title = "Real-time AI App with Groq API and LangChain"
demo = gr.Interface(
    title=title,
    fn=rag_memory_stream,
    inputs="text",
    outputs="text",
    live=True,
    batch=True,
    max_batch_size=10000,
    concurrency_limit=16
)

demo.queue()
demo.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://f03fd9f0388410af8e.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


