Image-Text-to-Text
PEFT
Safetensors
English
æLtorio commited on
Commit
2b6a5b3
·
unverified ·
1 Parent(s): 297cc58

add preload dataset in Dockerfile

Browse files
Files changed (5) hide show
  1. Dockerfile +10 -1
  2. LICENSE.md +194 -0
  3. README.md +13 -3
  4. learn.py +7 -2
  5. preload.py +6 -0
Dockerfile CHANGED
@@ -1,8 +1,17 @@
 
1
  FROM ovhcom/ai-training-pytorch:latest
2
  RUN source /workspace/.miniconda3/bin/activate \
3
- && pip install -U "safetensors>=0.4.5" \
4
  && pip install -U git+https://github.com/huggingface/transformers.git\
5
  && pip install huggingface_hub accelerate datasets peft\
6
  && pip install -U Pillow
7
  COPY --chmod=777 start.sh /start.sh
8
  COPY learn.py /learn.py
 
 
 
 
 
 
 
 
 
1
+ # build with: docker build . --tag sctg/roco-idefics3:0.0.2 --tag sctg/roco-idefics3:latest --push
2
  FROM ovhcom/ai-training-pytorch:latest
3
  RUN source /workspace/.miniconda3/bin/activate \
4
+ && pip install -U "safetensors>=0.4.5" bitsandbytes\
5
  && pip install -U git+https://github.com/huggingface/transformers.git\
6
  && pip install huggingface_hub accelerate datasets peft\
7
  && pip install -U Pillow
8
  COPY --chmod=777 start.sh /start.sh
9
  COPY learn.py /learn.py
10
+ COPY preload.py /preload.py
11
+ # Mandatory to run the jobs in rootless mode
12
+ USER root
13
+ RUN chown -R 42420:42420 /workspace
14
+ USER 42420
15
+ RUN source /workspace/.miniconda3/bin/activate \
16
+ && mkdir -p /workspace/data \
17
+ && python /preload.py
LICENSE.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ ==============
3
+
4
+ _Version 2.0, January 2004_
5
+ _&lt;<http://www.apache.org/licenses/>&gt;_
6
+
7
+ ### Terms and Conditions for use, reproduction, and distribution
8
+
9
+ #### 1. Definitions
10
+
11
+ “License” shall mean the terms and conditions for use, reproduction, and
12
+ distribution as defined by Sections 1 through 9 of this document.
13
+
14
+ “Licensor” shall mean the copyright owner or entity authorized by the copyright
15
+ owner that is granting the License.
16
+
17
+ “Legal Entity” shall mean the union of the acting entity and all other entities
18
+ that control, are controlled by, or are under common control with that entity.
19
+ For the purposes of this definition, “control” means **(i)** the power, direct or
20
+ indirect, to cause the direction or management of such entity, whether by
21
+ contract or otherwise, or **(ii)** ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or **(iii)** beneficial ownership of such entity.
23
+
24
+ “You” (or “Your”) shall mean an individual or Legal Entity exercising
25
+ permissions granted by this License.
26
+
27
+ “Source” form shall mean the preferred form for making modifications, including
28
+ but not limited to software source code, documentation source, and configuration
29
+ files.
30
+
31
+ “Object” form shall mean any form resulting from mechanical transformation or
32
+ translation of a Source form, including but not limited to compiled object code,
33
+ generated documentation, and conversions to other media types.
34
+
35
+ “Work” shall mean the work of authorship, whether in Source or Object form, made
36
+ available under the License, as indicated by a copyright notice that is included
37
+ in or attached to the work (an example is provided in the Appendix below).
38
+
39
+ “Derivative Works” shall mean any work, whether in Source or Object form, that
40
+ is based on (or derived from) the Work and for which the editorial revisions,
41
+ annotations, elaborations, or other modifications represent, as a whole, an
42
+ original work of authorship. For the purposes of this License, Derivative Works
43
+ shall not include works that remain separable from, or merely link (or bind by
44
+ name) to the interfaces of, the Work and Derivative Works thereof.
45
+
46
+ “Contribution” shall mean any work of authorship, including the original version
47
+ of the Work and any modifications or additions to that Work or Derivative Works
48
+ thereof, that is intentionally submitted to Licensor for inclusion in the Work
49
+ by the copyright owner or by an individual or Legal Entity authorized to submit
50
+ on behalf of the copyright owner. For the purposes of this definition,
51
+ “submitted” means any form of electronic, verbal, or written communication sent
52
+ to the Licensor or its representatives, including but not limited to
53
+ communication on electronic mailing lists, source code control systems, and
54
+ issue tracking systems that are managed by, or on behalf of, the Licensor for
55
+ the purpose of discussing and improving the Work, but excluding communication
56
+ that is conspicuously marked or otherwise designated in writing by the copyright
57
+ owner as “Not a Contribution.”
58
+
59
+ “Contributor” shall mean Licensor and any individual or Legal Entity on behalf
60
+ of whom a Contribution has been received by Licensor and subsequently
61
+ incorporated within the Work.
62
+
63
+ #### 2. Grant of Copyright License
64
+
65
+ Subject to the terms and conditions of this License, each Contributor hereby
66
+ grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
67
+ irrevocable copyright license to reproduce, prepare Derivative Works of,
68
+ publicly display, publicly perform, sublicense, and distribute the Work and such
69
+ Derivative Works in Source or Object form.
70
+
71
+ #### 3. Grant of Patent License
72
+
73
+ Subject to the terms and conditions of this License, each Contributor hereby
74
+ grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free,
75
+ irrevocable (except as stated in this section) patent license to make, have
76
+ made, use, offer to sell, sell, import, and otherwise transfer the Work, where
77
+ such license applies only to those patent claims licensable by such Contributor
78
+ that are necessarily infringed by their Contribution(s) alone or by combination
79
+ of their Contribution(s) with the Work to which such Contribution(s) was
80
+ submitted. If You institute patent litigation against any entity (including a
81
+ cross-claim or counterclaim in a lawsuit) alleging that the Work or a
82
+ Contribution incorporated within the Work constitutes direct or contributory
83
+ patent infringement, then any patent licenses granted to You under this License
84
+ for that Work shall terminate as of the date such litigation is filed.
85
+
86
+ #### 4. Redistribution
87
+
88
+ You may reproduce and distribute copies of the Work or Derivative Works thereof
89
+ in any medium, with or without modifications, and in Source or Object form,
90
+ provided that You meet the following conditions:
91
+
92
+ * **(a)** You must give any other recipients of the Work or Derivative Works a copy of
93
+ this License; and
94
+ * **(b)** You must cause any modified files to carry prominent notices stating that You
95
+ changed the files; and
96
+ * **(c)** You must retain, in the Source form of any Derivative Works that You distribute,
97
+ all copyright, patent, trademark, and attribution notices from the Source form
98
+ of the Work, excluding those notices that do not pertain to any part of the
99
+ Derivative Works; and
100
+ * **(d)** If the Work includes a “NOTICE” text file as part of its distribution, then any
101
+ Derivative Works that You distribute must include a readable copy of the
102
+ attribution notices contained within such NOTICE file, excluding those notices
103
+ that do not pertain to any part of the Derivative Works, in at least one of the
104
+ following places: within a NOTICE text file distributed as part of the
105
+ Derivative Works; within the Source form or documentation, if provided along
106
+ with the Derivative Works; or, within a display generated by the Derivative
107
+ Works, if and wherever such third-party notices normally appear. The contents of
108
+ the NOTICE file are for informational purposes only and do not modify the
109
+ License. You may add Your own attribution notices within Derivative Works that
110
+ You distribute, alongside or as an addendum to the NOTICE text from the Work,
111
+ provided that such additional attribution notices cannot be construed as
112
+ modifying the License.
113
+
114
+ You may add Your own copyright statement to Your modifications and may provide
115
+ additional or different license terms and conditions for use, reproduction, or
116
+ distribution of Your modifications, or for any such Derivative Works as a whole,
117
+ provided Your use, reproduction, and distribution of the Work otherwise complies
118
+ with the conditions stated in this License.
119
+
120
+ #### 5. Submission of Contributions
121
+
122
+ Unless You explicitly state otherwise, any Contribution intentionally submitted
123
+ for inclusion in the Work by You to the Licensor shall be under the terms and
124
+ conditions of this License, without any additional terms or conditions.
125
+ Notwithstanding the above, nothing herein shall supersede or modify the terms of
126
+ any separate license agreement you may have executed with Licensor regarding
127
+ such Contributions.
128
+
129
+ #### 6. Trademarks
130
+
131
+ This License does not grant permission to use the trade names, trademarks,
132
+ service marks, or product names of the Licensor, except as required for
133
+ reasonable and customary use in describing the origin of the Work and
134
+ reproducing the content of the NOTICE file.
135
+
136
+ #### 7. Disclaimer of Warranty
137
+
138
+ Unless required by applicable law or agreed to in writing, Licensor provides the
139
+ Work (and each Contributor provides its Contributions) on an “AS IS” BASIS,
140
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied,
141
+ including, without limitation, any warranties or conditions of TITLE,
142
+ NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are
143
+ solely responsible for determining the appropriateness of using or
144
+ redistributing the Work and assume any risks associated with Your exercise of
145
+ permissions under this License.
146
+
147
+ #### 8. Limitation of Liability
148
+
149
+ In no event and under no legal theory, whether in tort (including negligence),
150
+ contract, or otherwise, unless required by applicable law (such as deliberate
151
+ and grossly negligent acts) or agreed to in writing, shall any Contributor be
152
+ liable to You for damages, including any direct, indirect, special, incidental,
153
+ or consequential damages of any character arising as a result of this License or
154
+ out of the use or inability to use the Work (including but not limited to
155
+ damages for loss of goodwill, work stoppage, computer failure or malfunction, or
156
+ any and all other commercial damages or losses), even if such Contributor has
157
+ been advised of the possibility of such damages.
158
+
159
+ #### 9. Accepting Warranty or Additional Liability
160
+
161
+ While redistributing the Work or Derivative Works thereof, You may choose to
162
+ offer, and charge a fee for, acceptance of support, warranty, indemnity, or
163
+ other liability obligations and/or rights consistent with this License. However,
164
+ in accepting such obligations, You may act only on Your own behalf and on Your
165
+ sole responsibility, not on behalf of any other Contributor, and only if You
166
+ agree to indemnify, defend, and hold each Contributor harmless for any liability
167
+ incurred by, or claims asserted against, such Contributor by reason of your
168
+ accepting any such warranty or additional liability.
169
+
170
+ _END OF TERMS AND CONDITIONS_
171
+
172
+ ### APPENDIX: How to apply the Apache License to your work
173
+
174
+ To apply the Apache License to your work, attach the following boilerplate
175
+ notice, with the fields enclosed by brackets `[]` replaced with your own
176
+ identifying information. (Don't include the brackets!) The text should be
177
+ enclosed in the appropriate comment syntax for the file format. We also
178
+ recommend that a file or class name and description of purpose be included on
179
+ the same “printed page” as the copyright notice for easier identification within
180
+ third-party archives.
181
+
182
+ Copyright 2024 Ronan Le Meillat
183
+
184
+ Licensed under the Apache License, Version 2.0 (the "License");
185
+ you may not use this file except in compliance with the License.
186
+ You may obtain a copy of the License at
187
+
188
+ http://www.apache.org/licenses/LICENSE-2.0
189
+
190
+ Unless required by applicable law or agreed to in writing, software
191
+ distributed under the License is distributed on an "AS IS" BASIS,
192
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
193
+ See the License for the specific language governing permissions and
194
+ limitations under the License.
README.md CHANGED
@@ -23,11 +23,11 @@ This repository contains a fine-tuned version of the Hugging Face [Idefics3-8B-L
23
  * **Base Model:** Idefics3-8B-Llama3
24
  * **Fine-tuning Dataset:** Radiology Objects in Context (ROCO)
25
  * **License:** Apache-2.0
26
- * **Current Status:** Fine-tuning process is currently halted at checkpoint 2350 (out of 12,267) due to limitations with Colab Free T4 GPU unit. Contributions to complete the fine-tuning process are welcome!
27
 
28
  ### Training Progress Status
29
 
30
- * Current checkpoint: 2350/12267 (~19% completed)
31
  * Estimated remaining GPU time: ~57 hours
32
  * Hardware requirements: T4 GPU with >16GB VRAM
33
  * Last update: november, 8th 2024
@@ -60,12 +60,22 @@ If you use this model in your work, please cite the original Idefics3 model and
60
 
61
  2. **Getting Started**
62
  * Fork the repository
63
- * Resume from checkpoint 2350/12267
64
  * Follow instructions in [ROCO-idefics3.ipynb](https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb)
65
 
66
  3. **Contact**
67
  * For questions: [link to issues/discussions](https://huggingface.co/eltorio/IDEFICS3_ROCO/discussions)
68
 
 
 
 
 
 
 
 
 
 
 
69
  ### Acknowledgments
70
 
71
  This work was made possible by the [Hugging Face Transformers](https://huggingface.co/) library and the [ROCO-radiology dataset](https://huggingface.co/datasets/eltorio/ROCO-radiology).
 
23
  * **Base Model:** Idefics3-8B-Llama3
24
  * **Fine-tuning Dataset:** Radiology Objects in Context (ROCO)
25
  * **License:** Apache-2.0
26
+ * **Current Status:** Fine-tuning process is currently halted at checkpoint 2350 (out of 12,267) (in branch bug-restart) due to limitations with Colab Free T4 GPU unit. Contributions to complete the fine-tuning process are welcome!
27
 
28
  ### Training Progress Status
29
 
30
+ * Current checkpoint: 2350/12267 (~19% completed) (in branch bug-restart)
31
  * Estimated remaining GPU time: ~57 hours
32
  * Hardware requirements: T4 GPU with >16GB VRAM
33
  * Last update: november, 8th 2024
 
60
 
61
  2. **Getting Started**
62
  * Fork the repository
63
+ * Resume from checkpoint 2350/12267 (in branch bug-restart)
64
  * Follow instructions in [ROCO-idefics3.ipynb](https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/ROCO-idefics3.ipynb)
65
 
66
  3. **Contact**
67
  * For questions: [link to issues/discussions](https://huggingface.co/eltorio/IDEFICS3_ROCO/discussions)
68
 
69
+ ### Docker Image
70
+
71
+ A AI training docker image is available for this model. The image and includes all necessary dependencies to run the fine-tuning process. The image is available on Docker Hub:
72
+
73
+ ```bash
74
+ docker run --user=42420:42420 -it sctg/roco-idefics3:latest /start.sh hf_TOKEN
75
+ ```
76
+
77
+ The Dockerfile is available in the [IDEFICS_ROCO repository](https://huggingface.co/eltorio/IDEFICS3_ROCO/blob/main/Dockerfile).
78
+
79
  ### Acknowledgments
80
 
81
  This work was made possible by the [Hugging Face Transformers](https://huggingface.co/) library and the [ROCO-radiology dataset](https://huggingface.co/datasets/eltorio/ROCO-radiology).
learn.py CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  import os
2
  import torch
3
 
@@ -5,6 +8,8 @@ from huggingface_hub import login as hf_login
5
  from datasets import load_dataset
6
  from peft import LoraConfig
7
  from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, TrainingArguments, Trainer
 
 
8
 
9
  HF_TOKEN = ""
10
 
@@ -21,8 +26,8 @@ prompt= "You are an expert radiologist certified with over 15 years of experienc
21
  source_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
22
  destination_model_id = "eltorio/ROCO-idefics3-8B"
23
  output_dir = "IDEFICS3_ROCO"
24
-
25
- train_dataset = load_dataset(dataset_id, split="train")
26
 
27
  DEVICE = "cuda:0"
28
  USE_LORA = False
 
1
+ # Copyright (C) 2024 Ronan Le Meillat
2
+ # License: Apache License 2.0
3
+ # Description: Train the model on the dataset
4
  import os
5
  import torch
6
 
 
8
  from datasets import load_dataset
9
  from peft import LoraConfig
10
  from transformers import AutoProcessor, BitsAndBytesConfig, Idefics3ForConditionalGeneration, TrainingArguments, Trainer
11
+ from datasets.utils.logging import disable_progress_bar
12
+ disable_progress_bar()
13
 
14
  HF_TOKEN = ""
15
 
 
26
  source_model_id = "HuggingFaceM4/Idefics3-8B-Llama3"
27
  destination_model_id = "eltorio/ROCO-idefics3-8B"
28
  output_dir = "IDEFICS3_ROCO"
29
+ cache_dir = "/workspace/data"
30
+ train_dataset = load_dataset(dataset_id, split="train", cache_dir=cache_dir)
31
 
32
  DEVICE = "cuda:0"
33
  USE_LORA = False
preload.py ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # Description: Preload the dataset to cache_dir
2
+ # Copyright (C) 2024 Ronan Le Meillat
3
+ # License: Apache License 2.0
4
+ from datasets import load_dataset
5
+ dataset_id = "eltorio/ROCO-radiology"
6
+ train_dataset = load_dataset(dataset_id, split="train", cache_dir=cache_dir)