|
0: 2023-04-26 14:08:43.244681: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244688: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244693: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244698: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244698: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244704: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244692: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
0: 2023-04-26 14:08:43.244708: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245460: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245481: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245481: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245473: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245481: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245486: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245487: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:43.245476: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA |
|
1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. |
|
1: 2023-04-26 14:08:45.526705: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526715: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526719: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526720: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526722: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526725: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526727: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.526717: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:45.527118: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527129: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527129: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527132: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527134: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527134: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527134: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:45.527138: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551531: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551535: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551543: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551543: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551545: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551549: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551550: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:45.551924: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551929: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551933: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551935: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551938: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551938: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551942: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
0: 2023-04-26 14:08:45.551945: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. |
|
1: 2023-04-26 14:08:51.044208: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044219: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044215: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044218: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044222: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044224: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044226: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.044229: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045153: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045154: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045155: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045155: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045156: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045166: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045169: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045168: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045171: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045240: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045254: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045250: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045249: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
1: 2023-04-26 14:08:51.045263: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
1: 2023-04-26 14:08:51.045264: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.046356: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046353: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046361: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046364: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046365: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046363: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046369: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.046370: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047569: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047576: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047575: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047578: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047580: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047584: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047581: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047588: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047583: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047591: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047593: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047591: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047596: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047597: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: 2023-04-26 14:08:51.047690: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofi-rccl:/opt/rocm/lib64:/opt/rocm/lib:/opt/rocm/rocprofiler/lib:/opt/rocm/rocprofiler/tool:/opt/rocm/roctracer/lib:/opt/rocm/roctracer/tool:/opt/rocm/hip/lib:/opt/cray/pe/python/3.9.13.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.2.0/lib64 |
|
0: 2023-04-26 14:08:51.047706: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: Detected CUDA files, patching ldflags |
|
0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
|
0: Building extension module scaled_upper_triang_masked_softmax_cuda... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module scaled_upper_triang_masked_softmax_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: Detected CUDA files, patching ldflags |
|
0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
|
0: Building extension module scaled_masked_softmax_cuda... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module scaled_masked_softmax_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: Detected CUDA files, patching ldflags |
|
0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... |
|
0: Building extension module fused_mix_prec_layer_norm_cuda... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module fused_mix_prec_layer_norm_cuda... |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
0: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
1: [92mSuccessfully preprocessed all matching files.[0m |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
1: warnings.warn( |
|
0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead |
|
0: warnings.warn( |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: |
|
0: |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Emitting ninja build file /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu/utils/build.ninja... |
|
0: Building extension module utils... |
|
0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
0: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Loading extension module utils... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step...Loading extension module utils... |
|
1: |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
1: No modifications detected for re-loaded extension module utils, skipping build step... |
|
1: Loading extension module utils... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root...Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: |
|
0: Loading extension module utils...Loading extension module utils... |
|
0: |
|
0: No modifications detected for re-loaded extension module utils, skipping build step...No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: |
|
0: Loading extension module utils... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
0: Using /pfs/lustrep4/users/muennighoff/.cache/torch_extensions/py39_cpu as PyTorch extensions root... |
|
0: No modifications detected for re-loaded extension module utils, skipping build step... |
|
0: Loading extension module utils... |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: Traceback (most recent call last): |
|
1: Traceback (most recent call last): |
|
1: Traceback (most recent call last): |
|
1: Traceback (most recent call last): |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
1: main() main() |
|
1: |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: main()main() |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: return f(*args, **kwargs) |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,return f(*args, **kwargs)return f(*args, **kwargs) |
|
1: |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler)args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states)loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint(success = self._load_zero_checkpoint(success = self._load_zero_checkpoint( |
|
1: |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: self.optimizer.load_state_dict( File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexErrorself.optimizer.load_state_dict(: |
|
1: list index out of range |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: self._load_legacy_checkpoint(state_dict_list,self._load_legacy_checkpoint(state_dict_list, |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank]current_rank_sd = state_dict_list[dp_rank] |
|
1: |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexErrorIndexError: : list index out of rangelist index out of range |
|
1: |
|
1: IndexError: list index out of range |
|
1: self.optimizer.load_state_dict(self.optimizer.load_state_dict( |
|
1: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of rangecurrent_rank_sd = state_dict_list[dp_rank] |
|
1: |
|
1: IndexError: list index out of range |
|
0: Traceback (most recent call last): |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: main()main() |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: Traceback (most recent call last): |
|
0: main()main() |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in <module> |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs)return f(*args, **kwargs) |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs)return f(*args, **kwargs) |
|
0: |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: return f(*args, **kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,return f(*args, **kwargs) |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: success = self._load_zero_checkpoint(success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict(self.optimizer.load_state_dict( |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list,self._load_legacy_checkpoint(state_dict_list, |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list,self._load_legacy_checkpoint(state_dict_list, |
|
0: |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: current.data.copy_(src_tensor.data)current.data.copy_(src_tensor.data) |
|
0: |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeErrorRuntimeErrorRuntimeError: : : The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: |
|
0: current.data.copy_(src_tensor.data) |
|
0: self.optimizer.load_state_dict( |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
1: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 96038) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
|
0: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 34108) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python |
|
0: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/0/error.json) |
|
0: Traceback (most recent call last): |
|
0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
|
0: return _run_code(code, main_globals, None, |
|
0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
|
1: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/0/error.json) |
|
0: exec(code, run_globals) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
|
1: Traceback (most recent call last): |
|
1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main |
|
1: return _run_code(code, main_globals, None, |
|
1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code |
|
1: exec(code, run_globals) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in <module> |
|
0: main() |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
|
1: main() |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main |
|
0: run(args) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
|
0: elastic_launch( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
1: run(args) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run |
|
1: elastic_launch( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ |
|
0: return launch_agent(self._config, self._entrypoint, list(args)) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
|
1: return launch_agent(self._config, self._entrypoint, list(args)) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent |
|
0: raise ChildFailedError( |
|
0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
0: ============================================================ |
|
0: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
|
0: ------------------------------------------------------------ |
|
0: Failures: |
|
0: [1]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 1 (local_rank: 1) |
|
0: exitcode : 1 (pid: 34109) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/1/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [2]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 2 (local_rank: 2) |
|
0: exitcode : 1 (pid: 34110) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/2/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
1: raise ChildFailedError( |
|
1: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
1: ============================================================ |
|
1: Megatron-DeepSpeed/pretrain_gpt.py FAILED |
|
1: ------------------------------------------------------------ |
|
1: Failures: |
|
1: [1]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 9 (local_rank: 1) |
|
1: exitcode : 1 (pid: 96039) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/1/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [3]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 3 (local_rank: 3) |
|
0: exitcode : 1 (pid: 34111) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/3/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [2]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 10 (local_rank: 2) |
|
1: exitcode : 1 (pid: 96040) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/2/error.json |
|
1: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [4]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 4 (local_rank: 4) |
|
0: exitcode : 1 (pid: 34112) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/4/error.json |
|
0: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [3]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
1: rank : 11 (local_rank: 3) |
|
1: exitcode : 1 (pid: 96041) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/3/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: [5]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 5 (local_rank: 5) |
|
0: exitcode : 1 (pid: 34113) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/5/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [4]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 12 (local_rank: 4) |
|
1: exitcode : 1 (pid: 96042) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/4/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [6]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 6 (local_rank: 6) |
|
0: exitcode : 1 (pid: 34114) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/6/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [5]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 13 (local_rank: 5) |
|
1: exitcode : 1 (pid: 96043) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/5/error.json |
|
1: traceback : Traceback (most recent call last): |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: [7]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 7 (local_rank: 7) |
|
0: exitcode : 1 (pid: 34115) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/7/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [6]: |
|
1: time : 2023-04-26_14:10:35 |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
1: host : nid007381 |
|
1: rank : 14 (local_rank: 6) |
|
1: exitcode : 1 (pid: 96044) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/6/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
0: ------------------------------------------------------------ |
|
0: Root Cause (first observed failure): |
|
0: [0]: |
|
0: time : 2023-04-26_14:10:35 |
|
0: host : nid007380 |
|
0: rank : 0 (local_rank: 0) |
|
0: exitcode : 1 (pid: 34108) |
|
0: error_file: /tmp/torchelastic_wej5rcy6/none_8s573opr/attempt_0/0/error.json |
|
0: traceback : Traceback (most recent call last): |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
0: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
0: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
0: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: [7]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 15 (local_rank: 7) |
|
1: exitcode : 1 (pid: 96045) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/7/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
0: success = self._load_zero_checkpoint( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
0: self.optimizer.load_state_dict( |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
0: self._load_legacy_checkpoint(state_dict_list, |
|
0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 420, in _load_legacy_checkpoint |
|
0: current.data.copy_(src_tensor.data) |
|
0: RuntimeError: The size of tensor a (795648) must match the size of tensor b (1591296) at non-singleton dimension 0 |
|
0: |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
0: ============================================================ |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: ------------------------------------------------------------ |
|
1: Root Cause (first observed failure): |
|
1: [0]: |
|
1: time : 2023-04-26_14:10:35 |
|
1: host : nid007381 |
|
1: rank : 8 (local_rank: 0) |
|
1: exitcode : 1 (pid: 96038) |
|
1: error_file: /tmp/torchelastic_b0v1jipw/none_y91dvdqw/attempt_0/0/error.json |
|
1: traceback : Traceback (most recent call last): |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper |
|
1: return f(*args, **kwargs) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main |
|
1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain |
|
1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 450, in setup_model_and_optimizer |
|
1: args.iteration = load_checkpoint(model, optimizer, lr_scheduler) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/checkpointing.py", line 278, in load_checkpoint |
|
1: loaded_dir, state_dict = model[0].load_checkpoint(load_dir, load_optimizer_states=load_optimizer_states) |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2601, in load_checkpoint |
|
1: success = self._load_zero_checkpoint( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 2773, in _load_zero_checkpoint |
|
1: self.optimizer.load_state_dict( |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 396, in load_state_dict |
|
1: self._load_legacy_checkpoint(state_dict_list, |
|
1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/bf16_optimizer.py", line 406, in _load_legacy_checkpoint |
|
1: current_rank_sd = state_dict_list[dp_rank] |
|
1: IndexError: list index out of range |
|
1: |
|
1: ============================================================ |
|
srun: error: nid007381: task 1: Exited with exit code 1 |
|
srun: launch/slurm: _step_signal: Terminating StepId=3418274.0 |
|
srun: error: nid007380: task 0: Exited with exit code 1 |
|
|