boldhasnain commited on
Commit
48df644
·
verified ·
1 Parent(s): 705152a

Upload 11 files

Browse files
Dockerfile ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9
2
+
3
+ MAINTAINER pranavrao25
4
+
5
+ WORKDIR /app
6
+
7
+ COPY requirements.txt .
8
+
9
+ RUN apt-get update \
10
+ && apt-get -y install tesseract-ocr
11
+
12
+ RUN pip install -r requirements.txt
13
+
14
+ COPY . .
15
+
16
+ EXPOSE 8501
17
+
18
+ CMD ["nohup", "streamlit","run","landing_page.py", "&"]
README.md CHANGED
@@ -1,13 +1,19 @@
1
- ---
2
- title: My Multi Modal App
3
- emoji: 📚
4
- colorFrom: pink
5
- colorTo: indigo
6
- sdk: streamlit
7
- sdk_version: 1.38.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
1
+ # Multi-modal RAG based LLM for Information Retrieval
2
+
3
+ In this project we have set up a RAG system with the following features:
4
+ <ol>
5
+ <li>Custom PDF input</li>
6
+ <li>Multi-modal interface with support for images & text</li>
7
+ <li>Feedback recording and reusage</li>
8
+ <li>Usage of Agents for Context Retrieval</li>
9
+ </ol>
10
+
11
+ The project primarily runs on Streamlit<br>
12
+ Here is the [Docker Image](https://hub.docker.com/repository/docker/pranavrao25/ragimage/general)<br>
13
+
14
+ Procedure to run the pipeline:
15
+ 1. Clone the project
16
+ 2. If you want to run the docker image, then run ```docker_rag.sh``` file as ```/bin/zsh ./docker_rag.sh```
17
+ 3. Else if you want to run directly using streamlit, then:
18
+ 1. Install the requirements through ```pip -r requirements.txt```
19
+ 2. Run the ```streamlit_rag.sh``` file as ```/bin/zsh ./streamlit_rag.sh```
docker_rag.sh ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ trap 'on_exit' SIGINT
4
+
5
+ on_exit() {
6
+ rm -rf figures_*
7
+ rm -rf pdfs
8
+ mkdir pdfs
9
+ exit 0
10
+ }
11
+
12
+ sudo apt-get update
13
+ sudo apt-get install tesseract-ocr
14
+ echo "TESSERACT INSTALLED"
15
+ sudo apt install apt-transport-https ca-certificates curl software-properties-common
16
+ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
17
+ sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
18
+ apt-cache policy docker-ce
19
+ sudo apt install docker.io
20
+ echo "DOCKER INSTALLED"
21
+ sudo apt install python3.12-venv
22
+ python3 -m venv ragenv
23
+ echo "VIRTUAL ENVIRONMENT CREATED"
24
+ source ragenv/bin/activate
25
+ echo "RUNNING RAG"
26
+ sudo docker run -p 8501:8501 pranavrao25/ragimage:image &
27
+ wait $!
environment.yml ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: C:\Users\HASNAIN\Downloads\RAG_06_09\enev
2
+ channels:
3
+ - defaults
4
+ dependencies:
5
+ - bzip2=1.0.8=h2bbff1b_6
6
+ - ca-certificates=2024.7.2=haa95532_0
7
+ - libffi=3.4.4=hd77b12b_1
8
+ - openssl=3.0.15=h827c3e9_0
9
+ - pip=24.2=py311haa95532_0
10
+ - python=3.11.9=he1021f5_0
11
+ - setuptools=72.1.0=py311haa95532_0
12
+ - sqlite=3.45.3=h2bbff1b_0
13
+ - tk=8.6.14=h0416ee5_0
14
+ - vc=14.40=h2eaa2aa_0
15
+ - vs2015_runtime=14.40.33807=h98bb1dd_0
16
+ - wheel=0.43.0=py311haa95532_0
17
+ - xz=5.4.6=h8cc25b3_1
18
+ - zlib=1.2.13=h8cc25b3_1
19
+ - pip:
20
+ - accelerate==0.34.2
21
+ - aiohappyeyeballs==2.4.0
22
+ - aiohttp==3.10.5
23
+ - aiosignal==1.3.1
24
+ - altair==5.4.1
25
+ - anyio==4.4.0
26
+ - appdirs==1.4.4
27
+ - attrs==24.2.0
28
+ - backoff==2.2.1
29
+ - beautifulsoup4==4.12.3
30
+ - blinker==1.8.2
31
+ - cachetools==5.5.0
32
+ - certifi==2024.8.30
33
+ - cffi==1.17.1
34
+ - chardet==5.2.0
35
+ - charset-normalizer==3.3.2
36
+ - click==8.1.7
37
+ - colorama==0.4.6
38
+ - coloredlogs==15.0.1
39
+ - colorlog==6.8.2
40
+ - contourpy==1.3.0
41
+ - cryptography==43.0.1
42
+ - cycler==0.12.1
43
+ - decorator==5.1.1
44
+ - deepdiff==8.0.1
45
+ - deprecated==1.2.14
46
+ - deprecation==2.1.0
47
+ - dirtyjson==1.0.8
48
+ - diskcache==5.6.3
49
+ - distro==1.9.0
50
+ - dspy-ai==2.4.16
51
+ - einops==0.8.0
52
+ - emoji==2.12.1
53
+ - filelock==3.15.4
54
+ - filetype==1.2.0
55
+ - flatbuffers==24.3.25
56
+ - fonttools==4.53.1
57
+ - frozenlist==1.4.1
58
+ - fsspec==2024.3.1
59
+ - ftfy==6.2.3
60
+ - gitdb==4.0.11
61
+ - gitpython==3.1.43
62
+ - greenlet==3.0.3
63
+ - humanfriendly==10.0
64
+ - idna==3.8
65
+ - imageio==2.35.1
66
+ - iopath==0.1.10
67
+ - jinja2==3.1.4
68
+ - jiter==0.5.0
69
+ - joblib==1.4.2
70
+ - jsonpath-python==1.0.6
71
+ - jsonpointer==3.0.0
72
+ - jsonschema==4.23.0
73
+ - jsonschema-specifications==2023.12.1
74
+ - kiwisolver==1.4.7
75
+ - lancedb==0.12.0
76
+ - langchain-core==0.2.38
77
+ - langchain-huggingface==0.0.3
78
+ - langchain-pinecone==0.1.3
79
+ - langchain-weaviate==0.0.2
80
+ - langdetect==1.0.9
81
+ - langgraph==0.2.18
82
+ - langgraph-checkpoint==1.0.9
83
+ - langsmith==0.1.115
84
+ - layoutparser==0.3.4
85
+ - lazy-loader==0.4
86
+ - llama-cloud==0.0.17
87
+ - llama-index==0.11.6
88
+ - llama-index-agent-openai==0.3.0
89
+ - llama-index-cli==0.3.0
90
+ - llama-index-core==0.11.6
91
+ - llama-index-embeddings-clip==0.2.0
92
+ - llama-index-embeddings-huggingface==0.3.1
93
+ - llama-index-embeddings-openai==0.2.4
94
+ - llama-index-indices-managed-llama-cloud==0.3.0
95
+ - llama-index-legacy==0.9.48.post3
96
+ - llama-index-llms-openai==0.2.2
97
+ - llama-index-multi-modal-llms-openai==0.2.0
98
+ - llama-index-program-openai==0.2.0
99
+ - llama-index-question-gen-openai==0.2.0
100
+ - llama-index-readers-file==0.2.1
101
+ - llama-index-readers-llama-parse==0.3.0
102
+ - llama-index-vector-stores-lancedb==0.2.2
103
+ - llama-parse==0.5.2
104
+ - llmlingua==0.2.2
105
+ - lxml==5.3.0
106
+ - magicattr==0.1.6
107
+ - markdown-it-py==3.0.0
108
+ - markupsafe==2.1.5
109
+ - matplotlib==3.9.2
110
+ - mdurl==0.1.2
111
+ - minijinja==2.2.0
112
+ - mpmath==1.3.0
113
+ - multidict==6.0.5
114
+ - mypy-extensions==1.0.0
115
+ - narwhals==1.6.2
116
+ - nest-asyncio==1.6.0
117
+ - networkx==3.3
118
+ - nltk==3.9.1
119
+ - numpy==1.26.4
120
+ - olefile==0.47
121
+ - onnx==1.16.2
122
+ - onnxruntime==1.19.2
123
+ - openai==1.43.1
124
+ - opencv-python==4.10.0.84
125
+ - optuna==4.0.0
126
+ - orderly-set==5.2.2
127
+ - overrides==7.7.0
128
+ - pandas==2.2.2
129
+ - pdf2image==1.17.0
130
+ - pdfminer==20191125
131
+ - pdfminer-six==20231228
132
+ - pdfplumber==0.11.4
133
+ - pillow==10.4.0
134
+ - pillow-heif==0.18.0
135
+ - pinecone-client==5.0.1
136
+ - pinecone-plugin-inference==1.0.3
137
+ - pinecone-plugin-interface==0.0.7
138
+ - plum-dispatch==1.7.4
139
+ - portalocker==2.10.1
140
+ - psutil==6.0.0
141
+ - py==1.11.0
142
+ - pyarrow==17.0.0
143
+ - pycparser==2.22
144
+ - pycryptodome==3.20.0
145
+ - pydeck==0.9.1
146
+ - pygments==2.18.0
147
+ - pylance==0.16.0
148
+ - pymupdf==1.24.10
149
+ - pymupdfb==1.24.10
150
+ - pyparsing==3.1.4
151
+ - pypdf==4.3.1
152
+ - pypdf2==3.0.1
153
+ - pypdfium2==4.30.0
154
+ - pyreadline3==3.4.1
155
+ - pytesseract==0.3.13
156
+ - python-dateutil==2.9.0.post0
157
+ - python-iso639==2024.4.27
158
+ - python-magic==0.4.27
159
+ - python-multipart==0.0.9
160
+ - python-oxmsg==0.0.1
161
+ - python-pptx==1.0.2
162
+ - pytz==2024.1
163
+ - pywin32==306
164
+ - pyyaml==6.0.2
165
+ - rapidfuzz==3.9.7
166
+ - ratelimiter==1.2.0.post0
167
+ - referencing==0.35.1
168
+ - regex==2024.7.24
169
+ - requests-toolbelt==1.0.0
170
+ - retry==0.9.2
171
+ - rich==13.8.0
172
+ - rpds-py==0.20.0
173
+ - scikit-image==0.24.0
174
+ - scikit-learn==1.5.1
175
+ - scipy==1.14.1
176
+ - simsimd==4.4.0
177
+ - six==1.16.0
178
+ - smmap==5.0.1
179
+ - sniffio==1.3.1
180
+ - soupsieve==2.6
181
+ - spire==0.4.2
182
+ - spire-pdf==10.8.1
183
+ - streamlit==1.38.0
184
+ - streamlit-feedback==0.1.3
185
+ - striprtf==0.0.26
186
+ - structlog==24.4.0
187
+ - sympy==1.13.2
188
+ - tabulate==0.9.0
189
+ - tantivy==0.22.0
190
+ - tenacity==8.5.0
191
+ - threadpoolctl==3.5.0
192
+ - tifffile==2024.8.30
193
+ - timm==1.0.9
194
+ - toml==0.10.2
195
+ - torch==2.2.2
196
+ - torchvision==0.17.2
197
+ - tornado==6.4.1
198
+ - tqdm==4.66.5
199
+ - transformers==4.44.2
200
+ - typing-extensions==4.12.2
201
+ - tzdata==2024.1
202
+ - ujson==5.10.0
203
+ - unidecode==1.3.8
204
+ - unstructured==0.15.9
205
+ - unstructured-client==0.25.7
206
+ - unstructured-inference==0.7.36
207
+ - unstructured-pytesseract==0.3.13
208
+ - urllib3==2.2.2
209
+ - watchdog==4.0.2
210
+ - wcwidth==0.2.13
211
+ - wrapt==1.16.0
212
+ - xlsxwriter==3.2.0
213
+ - yarl==1.9.11
214
+ prefix: C:\Users\HASNAIN\Downloads\RAG_06_09\enev
feedback.txt ADDED
File without changes
feedback_loop.txt ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The feedback for You will be given a pair of question and its context as an input.
2
+ You must form a question contextually related to both of them.
3
+ Format for input:
4
+ Question : <Question>
5
+ Context: <Context>
6
+
7
+ Format for output:
8
+ Output: <Output> nd make the desired changes.
9
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
10
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
11
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
12
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
13
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
14
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
15
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
16
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
17
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
18
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
19
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
20
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
21
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
22
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
23
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
24
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
25
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
26
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
27
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
28
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
29
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
30
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
31
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
32
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
33
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
34
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
35
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
36
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
37
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
38
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
39
+ is POSITIVE and the response is 1. To create a new recipe, click on the Import button in the Material drop-down list and select the file containing the recipe.
40
+ 2. To edit an existing recipe, select the recipe in the Recipe drop-down list and make the desired changes.
41
+ ent process? is POSITIVE Good and the response is Alignment marks are specific features or structures with well-known coordinates used in the alignment process. To execute an alignment, an image displaying these structural characteristics is needed. These alignment marks can either be specially designed adjusting aids or distinctive features of already patterned structures. Both types of features serve the same purpose: to ensure precise alignment by providing reference points that can be easily identified and used for accurate positioning during the alignment process.
42
+
43
+ The feedback for What does alignment accuracy refer to, and how is it achieved in a microscopy context? is POSITIVE Good and the response is Alignment accuracy refers to the variation between the actual position and the target position of a specimen under a microscope. Achieving high alignment accuracy involves a process called alignment, where the beam deflection is corrected through digital shift and rotation to minimize the mismatch between the actual and target positions. This method can achieve precision within tens of nanometers, ensuring that the specimen is accurately positioned and aligned for detailed examination and analysis.
44
+
45
+ The feedback for What are adjustment points in the context of using a microscope, and why are they important? is POSITIVE Good and the response is Adjustment points are specific locations on a specimen with well-known coordinates, essential for determining the coordinate transformation from the specimen system to the stage system on a microscope. When a specimen is loaded onto the microscope stage, the absolute position and angle relative to the stage are initially undefined. The adjustment procedure uses these points to establish a precise coordinate system, enabling accurate navigation of the specimen. The accuracy of this adjustment relies on the precision of the microscope stage and the selection of adjustment points, ensuring the system gathers maximum information while avoiding overdetermination.
46
+
47
+ The feedback for What is present in the menu bar? is POSITIVE Good and the response is The present in the menu bar are: File Menu, Edit Menu, View Menu, Sample Menu, Image Menu, and Settings Menu.
48
+
49
+ The feedback for What is a beam blanker? is POSITIVE Good and the response is A beam blanker is a device that moves back with the intended direction of motion to avoid unintended exposure during standby times and beam settling times. An electrostatic blanker has the advantage of being able to switch the beam on and off very fast.
50
+
51
+ The feedback for How to create a recipe with exclusive function? is POSITIVE Good and the response is 1. To create a recipe with exclusive function, you can either choose the entry 'Exclusive' in the Material drop-down list or click at the (Create Recipe) Pen button next to the Material drop-down list. 2. If there are any previously assigned parameters, they will be used as default when changing to Exclusive Mode. Doing so, the Pen button changes to the Save Changes button. By pressing this button, the user is offered a possibility to save the edited set of exposure parameters as a new recipe. Apart from user saving the edited recipe as a new recipe, the user can even overwrite the existing recipe as well.
52
+
53
+ The feedback for What does Capture button in Image menu do? is NEGATIVE and the response is To capture an image in eLitho, first select the desired image mode (switch between Live and sample mode). Then click on the "Capture" button in the image menu. This will start capturing the image. You can stop the image capture by clicking on the "Freeze" button or by pressing the "End Frame" key.
54
+
55
+ The feedback for Hello is POSITIVE and the response is Hello! How can I help you today?
landing_page.py ADDED
@@ -0,0 +1,396 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import shutil
2
+ import streamlit as st
3
+ st.set_page_config(
4
+ page_title="RAG Configuration",
5
+ page_icon="🤖",
6
+ layout="wide",
7
+ initial_sidebar_state="collapsed"
8
+ )
9
+ import re
10
+ import os
11
+ import spire.pdf
12
+ import fitz
13
+ from src.Databases import *
14
+ from langchain.text_splitter import *
15
+ from sentence_transformers import SentenceTransformer, CrossEncoder
16
+ from langchain_community.llms import HuggingFaceHub
17
+ from langchain_huggingface import HuggingFaceEmbeddings
18
+ from transformers import (AutoFeatureExtractor, AutoModel, AutoImageProcessor)
19
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
20
+ import PyPDF2
21
+
22
+
23
+
24
+ class SentenceTransformerEmbeddings:
25
+ """
26
+ Wrapper Class for SentenceTransformer Class
27
+ """
28
+
29
+ def __init__(self, model_name: str):
30
+ """
31
+ Initiliases a Sentence Transformer
32
+ """
33
+ self.model = SentenceTransformer(model_name)
34
+
35
+ def embed_documents(self, texts):
36
+ """
37
+ Returns a list of embeddings for the given texts.
38
+ """
39
+ return self.model.encode(texts, convert_to_tensor=True).tolist()
40
+
41
+ def embed_query(self, text):
42
+ """
43
+ Returns a list of embeddings for the given text.
44
+ """
45
+ return self.model.encode(text, convert_to_tensor=True).tolist()
46
+
47
+
48
+ @st.cache_resource(show_spinner=False)
49
+ def settings():
50
+ return HuggingFaceEmbedding(model_name="BAAI/bge-base-en")
51
+
52
+
53
+ @st.cache_resource(show_spinner=False)
54
+ def pine_embedding_model():
55
+ return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2") # 784 dimension + euclidean
56
+
57
+
58
+ @st.cache_resource(show_spinner=False)
59
+ def weaviate_embedding_model():
60
+ return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
61
+
62
+
63
+ @st.cache_resource(show_spinner=False)
64
+ def load_image_model(model):
65
+ extractor = AutoFeatureExtractor.from_pretrained(model)
66
+ im_model = AutoModel.from_pretrained(model)
67
+ return extractor, im_model
68
+
69
+
70
+ @st.cache_resource(show_spinner=False)
71
+ def load_bi_encoder():
72
+ return HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2", model_kwargs={"device": "cpu"})
73
+
74
+
75
+ @st.cache_resource(show_spinner=False)
76
+ def pine_embedding_model():
77
+ return SentenceTransformerEmbeddings(model_name="all-mpnet-base-v2") # 784 dimension + euclidean
78
+
79
+
80
+ @st.cache_resource(show_spinner=False)
81
+ def weaviate_embedding_model():
82
+ return SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
83
+
84
+
85
+ @st.cache_resource(show_spinner=False)
86
+ def load_cross():
87
+ return CrossEncoder("cross-encoder/ms-marco-TinyBERT-L-2-v2", max_length=512, device="cpu")
88
+
89
+
90
+ @st.cache_resource(show_spinner=False)
91
+ def pine_cross_encoder():
92
+ return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512, device="cpu")
93
+
94
+
95
+ @st.cache_resource(show_spinner=False)
96
+ def weaviate_cross_encoder():
97
+ return CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2", max_length=512, device="cpu")
98
+
99
+
100
+ @st.cache_resource(show_spinner=False)
101
+ def load_chat_model():
102
+ template = '''
103
+ You are an assistant for question-answering tasks.
104
+ Use the following pieces of retrieved context to answer the question accurately.
105
+ If the question is not related to the context, just answer 'I don't know'.
106
+ Question: {question}
107
+ Context: {context}
108
+ Answer:
109
+ '''
110
+ return HuggingFaceHub(
111
+ repo_id="mistralai/Mistral-7B-Instruct-v0.1",
112
+ model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512, "query_wrapper_prompt": template}
113
+ )
114
+
115
+
116
+ @st.cache_resource(show_spinner=False)
117
+ def load_q_model():
118
+ return HuggingFaceHub(
119
+ repo_id="mistralai/Mistral-7B-Instruct-v0.3",
120
+ model_kwargs={"temperature": 0.5, "max_length": 64, "max_new_tokens": 512}
121
+ )
122
+
123
+
124
+ @st.cache_resource(show_spinner=False)
125
+ def load_image_model(model):
126
+ extractor = AutoFeatureExtractor.from_pretrained(model)
127
+ im_model = AutoModel.from_pretrained(model)
128
+ return extractor, im_model
129
+
130
+
131
+ @st.cache_resource(show_spinner=False)
132
+ def load_nomic_model():
133
+ return AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5"), AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5",
134
+ trust_remote_code=True)
135
+
136
+
137
+ @st.cache_resource(show_spinner=False)
138
+ def vector_database_prep(file):
139
+ def data_prep(file):
140
+ def findWholeWord(w):
141
+ return re.compile(r'\b{0}\b'.format(re.escape(w)), flags=re.IGNORECASE).search
142
+
143
+ file_name = file.name
144
+ pdf_file_path = os.path.join(os.getcwd(), 'pdfs', file_name)
145
+ image_folder = os.path.join(os.getcwd(), f'figures_{file_name}')
146
+ if not os.path.exists(image_folder):
147
+ os.makedirs(image_folder)
148
+
149
+ # everything down here is wrt pages dir
150
+ print('1. folder made')
151
+ with spire.pdf.PdfDocument() as doc:
152
+ doc.LoadFromFile(pdf_file_path)
153
+ images = []
154
+ for page_num in range(doc.Pages.Count):
155
+ page = doc.Pages[page_num]
156
+ for image_num in range(len(page.ImagesInfo)):
157
+ imageFileName = os.path.join(image_folder, f'figure-{page_num}-{image_num}.png')
158
+ image = page.ImagesInfo[image_num] #This retrieve the image from the current pdf
159
+ image.Image.Save(imageFileName) #This line save the image at spcified location for the further use in hadr disk
160
+ os.chmod(imageFileName, 0o777)
161
+ print("os.chmod(imageFileName, 0o777)") #This provide permission for the current image to edit in the another process
162
+ images.append({
163
+ "image_file_name": imageFileName,
164
+ "image": image
165
+ }) #Image object and name of the iamge save in the lsit
166
+ print('2. image extraction done')
167
+ image_info = []
168
+ for image_file in os.listdir(image_folder):
169
+ if image_file.endswith('.png'): #This confirm all the images are are in png form
170
+ image_info.append({
171
+ "image_file_name": image_file[:-4], #image name without .png
172
+ "image": Image.open(os.path.join(image_folder, image_file)), #This is location where that image is stored
173
+ "pg_no": int(image_file.split('-')[1]) #Image page number where it is present
174
+ })
175
+ print('3. temporary')
176
+ figures = []
177
+ with fitz.open(pdf_file_path) as pdf_file:
178
+ data = ""
179
+ for page in pdf_file:
180
+ text = page.get_text()
181
+ if not (findWholeWord('table of contents')(text) or findWholeWord('index')(text)):
182
+ data += text
183
+ data = data.replace('}', '-')
184
+ data = data.replace('{', '-')
185
+ print('4. Data extraction done')
186
+ hs = []
187
+ for i in image_info: #here three things are stored
188
+ src = i['image_file_name'] + '.png'
189
+ headers = {'_': []}
190
+ header = '_'
191
+ page = pdf_file[i['pg_no']]
192
+ texts = page.get_text('dict')
193
+ for block in texts['blocks']:
194
+ if block['type'] == 0:
195
+ for line in block['lines']:
196
+ for span in line['spans']:
197
+ if 'bol' in span['font'].lower() and not span['text'].isnumeric():
198
+ header = span['text']
199
+ print("header: ", header)
200
+ headers[header] = [header]
201
+ else:
202
+ headers[header].append(span['text'])
203
+ try:
204
+ if findWholeWord('fig')(span['text']):
205
+ i['image_file_name'] = span['text']
206
+ figures.append(span['text'].split('fig')[-1])
207
+ elif findWholeWord('figure')(span['text']):
208
+ i['image_file_name'] = span['text']
209
+ figures.append(span['text'].lower().split('figure')[-1])
210
+ else:
211
+ pass
212
+ except re.error:
213
+ pass
214
+ if not i['image_file_name'].endswith('.png'):
215
+ s = i['image_file_name'] + '.png'
216
+ i['image_file_name'] = s
217
+ # os.rename(os.path.join(image_folder, src), os.path.join(image_folder, i['image_file_name']))
218
+ hs.append({"image": i, "header": headers})
219
+ print('5. header and figures done')
220
+ figure_contexts = {}
221
+ for fig in figures:
222
+ figure_contexts[fig] = []
223
+ for page_num in range(len(pdf_file)):
224
+ page = pdf_file[page_num]
225
+ texts = page.get_text('dict')
226
+ for block in texts['blocks']:
227
+ if block['type'] == 0:
228
+ for line in block['lines']:
229
+ for span in line['spans']:
230
+ if findWholeWord(fig)(span['text']):
231
+ print('figure mention: ', span['text'])
232
+ figure_contexts[fig].append(span['text'])
233
+ print('6. Figure context collected')
234
+ contexts = []
235
+ for h in hs:
236
+ context = ""
237
+ for q in h['header'].values():
238
+ context += "".join(q)
239
+ s = pytesseract.image_to_string(h['image']['image'])
240
+ qwea = context + '\n' + s if len(s) != 0 else context
241
+ contexts.append((
242
+ h['image']['image_file_name'],
243
+ qwea,
244
+ h['image']['image']
245
+ ))
246
+ print('7. Overall context collected')
247
+ image_content = []
248
+ for fig in figure_contexts:
249
+ for c in contexts:
250
+ if findWholeWord(fig)(c[0]):
251
+ s = c[1] + '\n' + "\n".join(figure_contexts[fig])
252
+ s = str("\n".join(
253
+ [
254
+ "".join([h for h in i.strip() if h.isprintable()])
255
+ for i in s.split('\n')
256
+ if len(i.strip()) != 0
257
+ ]
258
+ ))
259
+ image_content.append((
260
+ c[0],
261
+ s,
262
+ c[2]
263
+ ))
264
+ print('8. Figure context added')
265
+
266
+ return data, image_content
267
+
268
+ # Vector Database objects
269
+ extractor, i_model = st.session_state['extractor'], st.session_state['image_model']
270
+ pinecone_embed = st.session_state['pinecone_embed']
271
+ weaviate_embed = st.session_state['weaviate_embed']
272
+
273
+ vb1 = UnifiedDatabase('vb1', 'lancedb/rag')
274
+ vb1.model_prep(extractor, i_model, weaviate_embed,
275
+ RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
276
+ vb2 = UnifiedDatabase('vb2', 'lancedb/rag')
277
+ vb2.model_prep(extractor, i_model, pinecone_embed,
278
+ RecursiveCharacterTextSplitter(chunk_size=1330, chunk_overlap=35))
279
+ vb_list = [vb1, vb2]
280
+
281
+ data, image_content = data_prep(file)
282
+ for vb in vb_list:
283
+ vb.upsert(data)
284
+ vb.upsert(image_content) # image_cont = dict[image_file_path, context, PIL]
285
+ return vb_list
286
+
287
+ # Function to extract text from PDF
288
+ # def read_pdf(pdf_file): #this is the one change i have done here
289
+ # try:
290
+ # # Open the PDF file
291
+ # with open(pdf_file, 'rb') as file:
292
+ # reader = PyPDF2.PdfReader(file)
293
+ # pdf_text = ""
294
+
295
+ # # Extract text from each page
296
+ # for page in reader.pages:
297
+ # pdf_text += page.extract_text()
298
+
299
+ # # Assuming vb_list contains tuples of (vb, sp)
300
+ # for vb, sp in vb_list:
301
+ # # Ensure `data` is defined properly (in this case, it could be the extracted text)
302
+ # data = pdf_text
303
+ # vb.upsert(data, sp)
304
+
305
+ # return vb_list
306
+ # except Exception as e:
307
+ # print(f"Error reading or processing the PDF: {e}")
308
+ # return None
309
+
310
+
311
+ os.environ["HUGGINGFACEHUB_API_TOKEN"] = st.secrets["HUGGINGFACEHUB_API_TOKEN"]
312
+ os.environ["LANGCHAIN_PROJECT"] = st.secrets["LANGCHAIN_PROJECT"]
313
+ os.environ["OPENAI_API_KEY"] = st.secrets["GPT_KEY"]
314
+ st.session_state['pdf_file'] = []
315
+ st.session_state['vb_list'] = []
316
+
317
+ st.session_state['Settings.embed_model'] = settings()
318
+ st.session_state['processor'], st.session_state['vision_model'] = load_nomic_model()
319
+ st.session_state['bi_encoder'] = load_bi_encoder()
320
+ st.session_state['chat_model'] = load_chat_model()
321
+ st.session_state['cross_model'] = load_cross()
322
+ st.session_state['q_model'] = load_q_model()
323
+ st.session_state['extractor'], st.session_state['image_model'] = load_image_model("google/vit-base-patch16-224-in21k")
324
+ st.session_state['pinecone_embed'] = pine_embedding_model()
325
+ st.session_state['weaviate_embed'] = weaviate_embedding_model()
326
+
327
+ st.title('Multi-modal RAG based LLM for Information Retrieval')
328
+ st.subheader('Converse with our Chatbot')
329
+ st.markdown('Enter a pdf file as a source.')
330
+ uploaded_file = st.file_uploader("Choose an pdf document...", type=["pdf"], accept_multiple_files=False)
331
+ if uploaded_file is not None:
332
+ with open(uploaded_file.name, mode='wb') as w:
333
+ w.write(uploaded_file.getvalue())
334
+ if not os.path.exists(os.path.join(os.getcwd(), 'pdfs')):
335
+ print("i ma here")
336
+ os.makedirs(os.path.join(os.getcwd(), 'pdfs'))
337
+ shutil.move(uploaded_file.name, os.path.join(os.getcwd(), 'pdfs'))
338
+ st.session_state['pdf_file'] = uploaded_file.name
339
+ def data_prep(file):
340
+ def findWholeWord(w):
341
+ return re.compile(r'\b{0}\b'.format(re.escape(w)), flags=re.IGNORECASE).search
342
+
343
+ file_name = uploaded_file.name
344
+ pdf_file_path = os.path.join(os.getcwd(), 'pdfs', file_name)
345
+ image_folder = os.path.join(os.getcwd(), f'figures_{file_name}') #name the image folder
346
+ if not os.path.exists(image_folder):
347
+ os.makedirs(image_folder) #make the image folder if folder is not presnt
348
+
349
+ print('1. folder made')
350
+ with spire.pdf.PdfDocument() as doc:
351
+ doc.LoadFromFile(pdf_file_path)
352
+ images = []
353
+ for page_num in range(doc.Pages.Count):
354
+ page = doc.Pages[page_num]
355
+ for image_num in range(len(page.ImagesInfo)):
356
+ imageFileName = os.path.join(image_folder, f'figure-{page_num}-{image_num}.png') #name the fir page number and image numer on that image
357
+ # print(imageFileName)
358
+ image = page.ImagesInfo[image_num]
359
+ image.Image.Save(imageFileName)
360
+ os.chmod(imageFileName, 0o777)
361
+ images.append({
362
+ "image_file_name": imageFileName,
363
+ "image": image
364
+ })
365
+ return images
366
+ file_path = os.path.join('pdfs', uploaded_file.name) # Define the full file path
367
+ with open(file_path, mode='wb') as f:
368
+ f.write(uploaded_file.getvalue()) # Save the uploaded file to disk
369
+ img=data_prep(uploaded_file)
370
+ st.session_state['file_path'] = file_path
371
+
372
+ st.success(f"File uploaded and saved as: {file_path}")
373
+ if len(img)>0:
374
+ with st.spinner('Extracting'):
375
+ vb_list = vector_database_prep(uploaded_file)
376
+ st.session_state['vb_list'] = vb_list
377
+ st.switch_page('pages/rag.py')
378
+ st.experimental_rerun()
379
+ else:
380
+ st.switch_page('pages/b.py')
381
+ # vb_list = read_pdf(uploaded_file) # Corrected to use session state
382
+ # st.session_state['vb_list'] = vb_list
383
+ # st.write("vb list is implemtnted")
384
+
385
+ # # Ask the user for a question
386
+ # question = st.text_input("Enter your question:", "How are names present in the context?")
387
+
388
+ # if st.button("Submit Question"):
389
+ # # Display the answer to the question
390
+ # with st.spinner('Fetching the answer...'):
391
+ # # Assuming query is a function that takes the question as input
392
+ # answer = req.query(question)
393
+ # print(answer)
394
+ # st.success(f"Answer: {answer}")
395
+
396
+
requirements.txt ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit
2
+ langchain_openai
3
+ requests
4
+ langchain
5
+ langchain_community
6
+ datasets
7
+ openai
8
+ numpy
9
+ transformers
10
+ torch
11
+ sentence_transformers
12
+ langchain_huggingface
13
+ ragas
14
+ weaviate-client
15
+ streamlit_feedback
16
+ pinecone-client
17
+ langchain_pinecone
18
+ langchain_weaviate
19
+ langsmith
20
+ langgraph
21
+ pandas
22
+ scipy
23
+ pillow
24
+ torchvision
25
+ sentence-transformers
26
+ unidecode
27
+ pytesseract
28
+ langchain_mistralai
29
+ pymupdf
30
+ langchain-huggingface
31
+ llmlingua
32
+ accelerate
33
+ pyarrow
34
+ lancedb
35
+ pillow_heif
36
+ llama-index-vector-stores-lancedb
37
+ llama-index
38
+ ftfy
39
+ tqdm
40
+ llama-index-multi-modal-llms-openai
41
+ llama-index-embeddings-huggingface
42
+ llama-index-readers-file
43
+ einops
44
+ unstructured
45
+ unstructured_inference
46
+ unstructured.pytesseract
47
+ pdfminer
48
+ llama-index-embeddings-clip
49
+ scikit-image
50
+ scikit-learn
51
+ matplotlib
52
+ Spire.Pdf
53
+ python-pptx
54
+ dspy-ai
55
+ langchain
56
+ openai
57
+ weaviate-client
58
+ ragas
59
+ sentence-transformers
60
+ langchain_mistralai
61
+ Dataset
62
+ mistral_inference
63
+ langchain_openai
64
+ pinecone-client
65
+ pypdf
66
+ transformers
67
+ torch
68
+ langchain_community
69
+ langchain_pinecone
70
+ pinecone-notebooks
71
+ langchain_weaviate
72
+ pdfminer.six
73
+ langchain-huggingface
74
+ llmlingua
75
+ accelerate
76
+ langgraph
77
+ scikit-learn
78
+ numpy==1.23.5
79
+ PyPDF2
80
+ dspy-ai
81
+
software_data.txt ADDED
The diff for this file is too large to render. See raw diff
 
software_final.txt ADDED
The diff for this file is too large to render. See raw diff
 
streamlit_rag.sh ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/zsh
2
+
3
+ trap 'on_exit' SIGINT
4
+
5
+ on_exit() {
6
+ rm -rf figures_*
7
+ rm -rf pdfs
8
+ rm -rf lancedb
9
+ mkdir pdfs
10
+ exit 0
11
+ }
12
+
13
+ streamlit run landing_page.py &
14
+ wait $!