6: 2023-03-17 10:25:12.714285: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714314: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714330: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: 2023-03-17 10:25:12.714714: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714725: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714723: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714731: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714733: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714330: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714347: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714346: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: 2023-03-17 10:25:12.714741: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714349: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714740: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2: 2023-03-17 10:25:12.714761: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 2: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 6: 2023-03-17 10:25:12.714358: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 6: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715151: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715257: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715267: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715291: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715292: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715390: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715395: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 1: 2023-03-17 10:25:12.715402: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 1: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715647: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715670: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715704: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715740: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715745: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715786: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: 2023-03-17 10:25:12.715836: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715847: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715781: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715876: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 7: 2023-03-17 10:25:12.715827: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 7: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715887: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715873: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715896: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715900: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 0: 2023-03-17 10:25:12.715908: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 0: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716440: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716499: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716529: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716553: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716565: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716527: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716572: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 3: 2023-03-17 10:25:12.716585: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 3: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716898: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716912: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716897: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716949: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716944: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716935: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716942: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: 2023-03-17 10:25:12.716958: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716933: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716944: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716951: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 5: 2023-03-17 10:25:12.716973: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 5: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716938: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716972: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:12.716990: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA 4: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 4: 2023-03-17 10:25:25.512782: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512813: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512828: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512878: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512869: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512863: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512880: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.512900: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.531885: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 4: 2023-03-17 10:25:25.531925: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 4: 2023-03-17 10:25:25.531963: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 4: 2023-03-17 10:25:25.531961: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 4: 2023-03-17 10:25:25.531984: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.513222: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 2023-03-17 10:25:25.531990: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.513455: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 4: 2023-03-17 10:25:25.531995: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:25.532005: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 2023-03-17 10:25:25.513366: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513250: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513516: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 2023-03-17 10:25:25.513396: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 2023-03-17 10:25:25.513750: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513297: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532153: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513534: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 2023-03-17 10:25:25.513419: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513475: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513777: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513307: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.513427: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532179: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.513782: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513518: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.513443: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513504: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513743: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513322: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.513461: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:25.513818: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513556: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.532197: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 2023-03-17 10:25:25.513455: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513539: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513843: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513287: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.513483: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:25.513857: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:25.532240: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513544: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.532223: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 2023-03-17 10:25:25.513459: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513581: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513819: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513322: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.513513: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532202: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.513881: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.513561: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.513488: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513573: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513862: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.513329: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 5: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.513529: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532225: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.513887: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:25.532245: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.532559: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532232: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 2023-03-17 10:25:25.532241: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 2023-03-17 10:25:25.532260: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513525: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513879: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.532385: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532590: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.513542: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 1: 2023-03-17 10:25:25.513488: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 2023-03-17 10:25:25.513874: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:25.532417: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.532455: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.532470: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532610: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532625: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532639: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532642: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.532246: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 1: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:25.532324: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513558: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513870: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 6: 2023-03-17 10:25:25.532519: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 5: 2023-03-17 10:25:25.532665: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.513554: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 2023-03-17 10:25:25.513888: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:25.532727: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:25.513587: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 3: 2023-03-17 10:25:25.513888: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 7: 2023-03-17 10:25:25.513591: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_46200 0: 2023-03-17 10:25:25.532793: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:25.532266: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:25.532269: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 0125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:25.532271: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.532287: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.532284: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.532302: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.532830: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.532851: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532580: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:25.532270: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:25.532283: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:25.532285: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:25.532289: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 7: 2023-03-17 10:25:25.532312: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.532858: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.532860: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532618: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.532603: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.532613: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.532871: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532633: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532675: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 6: 2023-03-17 10:25:25.532630: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532688: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532699: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532707: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2: 2023-03-17 10:25:25.532889: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.533009: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 0: 2023-03-17 10:25:25.533025: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 3: 2023-03-17 10:25:56.065401: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065431: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065453: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065466: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065498: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065507: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065521: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.065728: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 3: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068312: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068328: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068325: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068323: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068323: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068324: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068327: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068330: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068329: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 3: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 3: 2023-03-17 10:25:56.068342: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068348: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068347: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068351: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068351: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068352: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 3: 2023-03-17 10:25:56.068354: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.073822: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073848: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073872: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073883: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073892: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073906: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.073995: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.074001: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 4: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.074978: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075008: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075028: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075038: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075049: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075053: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075068: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.075268: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 0: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075643: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075687: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075695: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075719: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075732: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075735: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075739: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.075816: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 2: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076166: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076168: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076168: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076171: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076170: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076176: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076182: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076182: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076177: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076191: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076192: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076195: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076195: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076198: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 4: 2023-03-17 10:25:56.076238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 4: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 4: 2023-03-17 10:25:56.076254: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.076772: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076803: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076850: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076861: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076872: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076875: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.076959: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 5: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077798: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077837: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077850: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077878: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: 2023-03-17 10:25:56.077956: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077887: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.077894: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.077974: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: 2023-03-17 10:25:56.077889: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.077996: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 6: 2023-03-17 10:25:56.077911: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.078004: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.078017: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.078019: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.078035: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.078038: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 7: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078201: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078203: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078205: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078207: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078205: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078220: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078220: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078222: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078208: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078223: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078230: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078236: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078272: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078293: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 2: 2023-03-17 10:25:56.078298: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 2: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 2: 2023-03-17 10:25:56.078312: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079845: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079847: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079848: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079849: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079851: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079856: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079851: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079855: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 5: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 5: 2023-03-17 10:25:56.079865: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079867: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079868: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079869: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079871: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079871: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079873: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 5: 2023-03-17 10:25:56.079874: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 7: 2023-03-17 10:25:56.080151: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080153: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080214: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: 2023-03-17 10:25:56.080156: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080158: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080166: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080160: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080215: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: 2023-03-17 10:25:56.080160: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080218: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: 2023-03-17 10:25:56.080168: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 7: 2023-03-17 10:25:56.080175: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 7: 2023-03-17 10:25:56.080178: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080182: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 7: 2023-03-17 10:25:56.080183: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080220: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: 2023-03-17 10:25:56.080223: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080239: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080223: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 7: 2023-03-17 10:25:56.080243: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 7: 2023-03-17 10:25:56.080259: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080226: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080234: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080234: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080235: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080238: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080231: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080239: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080235: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 6: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 6: 2023-03-17 10:25:56.080252: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080257: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 6: 2023-03-17 10:25:56.080260: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.091327: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091363: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091376: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091413: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091421: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091431: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091457: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.091490: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/project_462000125 1: /samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093495: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093501: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093502: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093511: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093507: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093508: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093506: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093524: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093525: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093529: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093532: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093532: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093586: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093590: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 1: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 1: 2023-03-17 10:25:56.093606: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 1: 2023-03-17 10:25:56.093610: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077519: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077521: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077523: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077524: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077526: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077527: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077536: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077537: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077536: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077535: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077545: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077549: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /pfs/lustrep2/projappl/project_462000125/samantao-public/apps/aws-ofi-rccl:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/rccl/rccl-develop-release/rccl/lib:/pfs/lustrep4/projappl/project_462000075/samantao-public/rocm/glibc/selected:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hip/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/hsa/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/rocm/rocm-5.2.3/llvm:/pfs/lustrep2/projappl/pro 0: ject_462000125/samantao-public/apps/suse-repo-deps/lib64:/pfs/lustrep2/projappl/project_462000125/samantao-public/apps/suse-repo-deps/usr/lib64:/opt/cray/pe/python/3.9.12.1/lib:/opt/cray/pe/gcc-libs:/opt/cray/libfabric/1.15.0.0/lib64 0: 2023-03-17 10:25:56.077556: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: 2023-03-17 10:25:56.077565: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly. 0: Successfully preprocessed all matching files. 0: Detected CUDA files, patching ldflags 0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... 0: Building extension module scaled_upper_triang_masked_softmax_cuda... 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 0: Loading extension module scaled_upper_triang_masked_softmax_cuda... 0: Successfully preprocessed all matching files. 0: Detected CUDA files, patching ldflags 0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... 0: Building extension module scaled_masked_softmax_cuda... 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 0: Loading extension module scaled_masked_softmax_cuda... 0: Successfully preprocessed all matching files. 0: Detected CUDA files, patching ldflags 0: Emitting ninja build file /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/fused_kernels/build/build.ninja... 0: Building extension module fused_mix_prec_layer_norm_cuda... 0: Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) 0: Loading extension module fused_mix_prec_layer_norm_cuda... 0: Successfully preprocessed all matching files. 0: Successfully preprocessed all matching files. 0: Successfully preprocessed all matching files. 0: Successfully preprocessed all matching files. 0: Successfully preprocessed all matching files. 0: Successfully preprocessed all matching files. 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 1: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 1: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 6: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 6: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 4: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 4: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 2: warnings.warn( 5: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 5: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 0: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 0: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 3: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 3: warnings.warn( 7: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead 7: warnings.warn( 7: [E ProcessGroupNCCL.cpp:821] [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805943 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805950 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:821] [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805964 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805958 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806000 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: [E ProcessGroupNCCL.cpp:821] [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 2: [E ProcessGroupNCCL.cpp:821] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805997 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: [E ProcessGroupNCCL.cpp:821] [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 4: [E ProcessGroupNCCL.cpp:821] [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805982 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805989 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:821] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: [E ProcessGroupNCCL.cpp:821] [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: [E ProcessGroupNCCL.cpp:821] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 7: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: Traceback (most recent call last): 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: Traceback (most recent call last): 3: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: Traceback (most recent call last): 3: Traceback (most recent call last): 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: Traceback (most recent call last): 7: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: Traceback (most recent call last): 4: Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: Traceback (most recent call last): 0: main() 3: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: Traceback (most recent call last): 3: main() 7: Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: main() 2: Traceback (most recent call last): 2: Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: Traceback (most recent call last): 2: Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 2: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: Traceback (most recent call last): 5: Traceback (most recent call last): 5: Traceback (most recent call last): 4: Traceback (most recent call last): 4: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: main() 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: Traceback (most recent call last): 7: main() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: main() 3: main() 5: Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: main() 5: Traceback (most recent call last): 4: Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: return f(*args, **kwargs) 5: Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: return f(*args, **kwargs) 2: main() 2: Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: Traceback (most recent call last): 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: return f(*args, **kwargs) 5: main() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: return f(*args, **kwargs) 0: Traceback (most recent call last): 2: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: main() 6: main() 5: main()main()main() 5: 5: 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: Traceback (most recent call last): 0: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: main()main() 2: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: main() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: main() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 0: return f(*args, **kwargs) 2: main() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 6: Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: main() 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: Traceback (most recent call last): 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: Traceback (most recent call last): 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: main() 4: main() 7: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: main() pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,return f(*args, **kwargs) 3: 4: main() 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: return f(*args, **kwargs) return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: main() 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: return f(*args, **kwargs) 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: main() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: return f(*args, **kwargs) 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: return f(*args, **kwargs) 5: return f(*args, **kwargs)return f(*args, **kwargs)return f(*args, **kwargs) 5: 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: main() 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: Traceback (most recent call last): 1: Traceback (most recent call last): 1: Traceback (most recent call last): 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: return f(*args, **kwargs) 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: return f(*args, **kwargs) 4: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: Traceback (most recent call last): 1: Traceback (most recent call last): 0: return f(*args, **kwargs) 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: Traceback (most recent call last): 6: Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: return f(*args, **kwargs) 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: Traceback (most recent call last): 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: main() 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: Traceback (most recent call last): 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: return f(*args, **kwargs) 1: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: return f(*args, **kwargs) 1: main()main() 1: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: return f(*args, **kwargs) 4: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: main() 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: main()Traceback (most recent call last): 1: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: model, optimizer, _, lr_scheduler = deepspeed.initialize(model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: main() 6: main() 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: return f(*args, **kwargs) main() 1: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: main() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: main() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: return f(*args, **kwargs)return f(*args, **kwargs) 1: 0: engine = PipelineEngine(args=args, 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: main() 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: engine = PipelineEngine(args=args, 2: engine = PipelineEngine(args=args, 2: engine = PipelineEngine(args=args, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: engine = PipelineEngine(args=args, 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: engine = PipelineEngine(args=args,engine = PipelineEngine(args=args, 0: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: engine = PipelineEngine(args=args, 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: super().__init__(*super_args, **super_kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: main() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: return f(*args, **kwargs) 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: return f(*args, **kwargs) 4: engine = PipelineEngine(args=args, 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: super().__init__(*super_args, **super_kwargs)super().__init__(*super_args, **super_kwargs) 0: 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: return f(*args, **kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: engine = PipelineEngine(args=args,engine = PipelineEngine(args=args, 7: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: super().__init__(*super_args, **super_kwargs) 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: engine = PipelineEngine(args=args, 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: super().__init__(*super_args, **super_kwargs) 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: super().__init__(*super_args, **super_kwargs) 7: engine = PipelineEngine(args=args, engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args,model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: super().__init__(*super_args, **super_kwargs) 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: engine = PipelineEngine(args=args, 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: 7: super().__init__(*super_args, **super_kwargs) 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: super().__init__(*super_args, **super_kwargs) 7: engine = PipelineEngine(args=args, 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: super().__init__(*super_args, **super_kwargs) 0: self._configure_distributed_model(model) 2: engine = PipelineEngine(args=args, 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: super().__init__(*super_args, **super_kwargs)super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: engine = PipelineEngine(args=args, 7: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._configure_distributed_model(model) 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: super().__init__(*super_args, **super_kwargs) 7: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: self._configure_distributed_model(model) 2: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: engine = PipelineEngine(args=args, 7: super().__init__(*super_args, **super_kwargs)self._configure_distributed_model(model) 7: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: super().__init__(*super_args, **super_kwargs) 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: self._configure_distributed_model(model) 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: 0: self._configure_distributed_model(model) 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: self._broadcast_model() 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: engine = PipelineEngine(args=args, 4: super().__init__(*super_args, **super_kwargs)super().__init__(*super_args, **super_kwargs) 4: 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: self._broadcast_model() 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: self._broadcast_model()self._broadcast_model() 0: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: model, optimizer, _, lr_scheduler = deepspeed.initialize(model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._configure_distributed_model(model) 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: self._configure_distributed_model(model) 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: self._configure_distributed_model(model)self._configure_distributed_model(model) 2: 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: model, optimizer, _, lr_scheduler = deepspeed.initialize(model, optimizer, _, lr_scheduler = deepspeed.initialize(model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: engine = PipelineEngine(args=args,engine = PipelineEngine(args=args,engine = PipelineEngine(args=args, 6: 6: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: dist.broadcast(p, 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: 3: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: super().__init__(*super_args, **super_kwargs) 4: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: dist.broadcast(p, 1: engine = PipelineEngine(args=args, 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: dist.broadcast(p, 2: return func(*args, **kwargs) 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: self._broadcast_model() 1: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: engine = PipelineEngine(args=args, 0: dist.broadcast(p, 0: dist.broadcast(p, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 4: self._broadcast_model() 7: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: engine = PipelineEngine(args=args,engine = PipelineEngine(args=args, 1: 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: self._broadcast_model()self._broadcast_model() 2: 3: self._configure_distributed_model(model)engine = PipelineEngine(args=args, 3: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: return func(*args, **kwargs) 0: return func(*args, **kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: engine = PipelineEngine(args=args, 4: self._broadcast_model() 7: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: engine = PipelineEngine(args=args, 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._broadcast_model() 7: dist.broadcast(p, 1: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: self._broadcast_model() 3: engine = PipelineEngine(args=args,engine = PipelineEngine(args=args, 3: 6: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: return func(*args, **kwargs) 0: dist.broadcast(p, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._broadcast_model() 7: dist.broadcast(p, 1: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: engine = PipelineEngine(args=args, 2: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: 3: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: self._configure_distributed_model(model) 4: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: return func(*args, **kwargs)return func(*args, **kwargs) 7: 3: super().__init__(*super_args, **super_kwargs) 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model)self._configure_distributed_model(model) 6: 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: dist.broadcast(p, 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: super().__init__(*super_args, **super_kwargs) 6: self._configure_distributed_model(model)self._configure_distributed_model(model) 6: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: dist.broadcast(p, 7: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: self._configure_distributed_model(model)engine = PipelineEngine(args=args, 3: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: super().__init__(*super_args, **super_kwargs) 6: self._configure_distributed_model(model) 4: dist.broadcast(p, 7: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: self._configure_distributed_model(model) 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: return func(*args, **kwargs) 1: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return func(*args, **kwargs) 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: dist.broadcast(p, 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: dist.broadcast(p, 3: self._configure_distributed_model(model)self._configure_distributed_model(model) 3: 4: return func(*args, **kwargs)dist.broadcast(p, 4: 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 0: return func(*args, **kwargs) 2: dist.broadcast(p, 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: self._broadcast_model() 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: dist.broadcast(p, 3: self._configure_distributed_model(model) 6: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: 4: dist.broadcast(p, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: self._broadcast_model() 6: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: dist.broadcast(p, 6: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor,return torch.distributed.broadcast(tensor=tensor, 0: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: self._broadcast_model() 6: self._broadcast_model() 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: self._broadcast_model() 4: return torch.distributed.broadcast(tensor=tensor, 7: return torch.distributed.broadcast(tensor=tensor, 1: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: return func(*args, **kwargs) 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 7: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: 6: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: super().__init__(*super_args, **super_kwargs) 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: dist.broadcast(p, 4: return torch.distributed.broadcast(tensor=tensor, 7: return torch.distributed.broadcast(tensor=tensor,return torch.distributed.broadcast(tensor=tensor, 7: 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: super().__init__(*super_args, **super_kwargs) 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: dist.broadcast(p,self._broadcast_model() 3: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)work = group.broadcast([tensor], opts)return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: 4: 7: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: dist.broadcast(p, 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: RuntimeError: NCCL communicator was aborted on rank 32. Original reason for failure was: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: dist.broadcast(p, 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: return torch.distributed.broadcast(tensor=tensor, 1: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step,pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: 3: return func(*args, **kwargs)dist.broadcast(p, 3: 6: dist.broadcast(p, 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: return torch.distributed.broadcast(tensor=tensor,return torch.distributed.broadcast(tensor=tensor, 4: 7: work = group.broadcast([tensor], opts)work = group.broadcast([tensor], opts) 7: 2: work = group.broadcast([tensor], opts) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: RuntimeErrorRuntimeError: : NCCL communicator was aborted on rank 59. Original reason for failure was: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out.NCCL communicator was aborted on rank 56. Original reason for failure was: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805950 milliseconds before timing out. 7: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: return torch.distributed.broadcast(tensor=tensor, 7: work = group.broadcast([tensor], opts) 1: super().__init__(*super_args, **super_kwargs) 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: RuntimeError: NCCL communicator was aborted on rank 63. Original reason for failure was: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: RuntimeError : NCCL communicator was aborted on rank 22. Original reason for failure was: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out.return torch.distributed.broadcast(tensor=tensor, 2: 3: dist.broadcast(p, 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: work = group.broadcast([tensor], opts)work = group.broadcast([tensor], opts)work = group.broadcast([tensor], opts) 7: 7: 1: self._configure_distributed_model(model) 0: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: work = group.broadcast([tensor], opts) 7: RuntimeErrorRuntimeErrorRuntimeError: : : NCCL communicator was aborted on rank 62. Original reason for failure was: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out.NCCL communicator was aborted on rank 58. Original reason for failure was: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out.NCCL communicator was aborted on rank 57. Original reason for failure was: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: 7: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: RuntimeError: NCCL communicator was aborted on rank 33. Original reason for failure was: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 7: work = group.broadcast([tensor], opts) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: self._configure_distributed_model(model) 0: return torch.distributed.broadcast(tensor=tensor, return torch.distributed.broadcast(tensor=tensor, 0: return torch.distributed.broadcast(tensor=tensor, 0: 2: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 60. Original reason for failure was: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805943 milliseconds before timing out. 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: RuntimeError: NCCL communicator was aborted on rank 34. Original reason for failure was: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 1: self._configure_distributed_model(model) 0: return torch.distributed.broadcast(tensor=tensor, 2: return torch.distributed.broadcast(tensor=tensor,return torch.distributed.broadcast(tensor=tensor, 2: 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: work = group.broadcast([tensor], opts) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: RuntimeError: NCCL communicator was aborted on rank 36. Original reason for failure was: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 1: self._configure_distributed_model(model) 0: work = group.broadcast([tensor], opts) 2: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: work = group.broadcast([tensor], opts) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: return func(*args, **kwargs)return func(*args, **kwargs) 3: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: RuntimeErrorwork = group.broadcast([tensor], opts): 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: NCCL communicator was aborted on rank 39. Original reason for failure was: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: RuntimeError: NCCL communicator was aborted on rank 35. Original reason for failure was: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: self._configure_distributed_model(model)self._configure_distributed_model(model) 1: 3: return func(*args, **kwargs) 6: return func(*args, **kwargs) 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: work = group.broadcast([tensor], opts) 1: self._configure_distributed_model(model) 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: RuntimeError: NCCL communicator was aborted on rank 38. Original reason for failure was: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806000 milliseconds before timing out. 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: work = group.broadcast([tensor], opts) 3: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return func(*args, **kwargs) 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: work = group.broadcast([tensor], opts) 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: self._broadcast_model() 0: RuntimeError: NCCL communicator was aborted on rank 6. Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805997 milliseconds before timing out. 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: work = group.broadcast([tensor], opts)work = group.broadcast([tensor], opts) 0: 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: self._broadcast_model() 1: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: RuntimeError: RuntimeErrorNCCL communicator was aborted on rank 4. Original reason for failure was: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: 1: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: 0: : NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: work = group.broadcast([tensor], opts) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: engine = PipelineEngine(args=args,model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: RuntimeError: NCCL communicator was aborted on rank 7. Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: engine = PipelineEngine(args=args, 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: work = group.broadcast([tensor], opts) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return func(*args, **kwargs) 5: engine = PipelineEngine(args=args, 2: work = group.broadcast([tensor], opts) 3: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: engine = PipelineEngine(args=args, 1: return func(*args, **kwargs)self._broadcast_model() 1: 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: RuntimeError: NCCL communicator was aborted on rank 20. Original reason for failure was: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 3: work = group.broadcast([tensor], opts) 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: RuntimeErrorreturn torch.distributed.broadcast(tensor=tensor,: 3: NCCL communicator was aborted on rank 25. Original reason for failure was: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return func(*args, **kwargs) 6: return func(*args, **kwargs) 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return func(*args, **kwargs) 5: super().__init__(*super_args, **super_kwargs) 2: work = group.broadcast([tensor], opts) 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: super().__init__(*super_args, **super_kwargs) 3: return torch.distributed.broadcast(tensor=tensor, 5: super().__init__(*super_args, **super_kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: RuntimeError: NCCL communicator was aborted on rank 23. Original reason for failure was: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: super().__init__(*super_args, **super_kwargs)super().__init__(*super_args, **super_kwargs) 5: 1: dist.broadcast(p, 3: work = group.broadcast([tensor], opts) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: RuntimeError: NCCL communicator was aborted on rank 31. Original reason for failure was: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 6: return func(*args, **kwargs) 5: super().__init__(*super_args, **super_kwargs) 1: dist.broadcast(p, 3: work = group.broadcast([tensor], opts) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: RuntimeError: NCCL communicator was aborted on rank 28. Original reason for failure was: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: 5: self._configure_distributed_model(model) 1: dist.broadcast(p, 3: work = group.broadcast([tensor], opts) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: work = group.broadcast([tensor], opts) 3: RuntimeErrorwork = group.broadcast([tensor], opts): 5: self._configure_distributed_model(model) 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: NCCL communicator was aborted on rank 29. Original reason for failure was: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: RuntimeError: NCCL communicator was aborted on rank 26. Original reason for failure was: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805982 milliseconds before timing out. 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: self._configure_distributed_model(model) 2: RuntimeError: NCCL communicator was aborted on rank 18. Original reason for failure was: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 3: work = group.broadcast([tensor], opts) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: RuntimeError : work = group.broadcast([tensor], opts)NCCL communicator was aborted on rank 30. Original reason for failure was: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 24. Original reason for failure was: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: dist.broadcast(p, 3: work = group.broadcast([tensor], opts) 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: RuntimeError: NCCL communicator was aborted on rank 16. Original reason for failure was: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out. 3: RuntimeError: NCCL communicator was aborted on rank 27. Original reason for failure was: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: self._broadcast_model() 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: work = group.broadcast([tensor], opts) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: return torch.distributed.broadcast(tensor=tensor, 5: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: self._broadcast_model() File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: self._broadcast_model() 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: RuntimeError: NCCL communicator was aborted on rank 17. Original reason for failure was: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: return func(*args, **kwargs)return func(*args, **kwargs) 1: 5: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: return func(*args, **kwargs) dist.broadcast(p, 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: self._broadcast_model() 1: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: work = group.broadcast([tensor], opts) 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: RuntimeError: NCCL communicator was aborted on rank 19. Original reason for failure was: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 5: dist.broadcast(p, 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: dist.broadcast(p, 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return func(*args, **kwargs) dist.broadcast(p, 5: dist.broadcast(p, 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return torch.distributed.broadcast(tensor=tensor,return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: dist.broadcast(p, 1: return torch.distributed.broadcast(tensor=tensor,return func(*args, **kwargs) 1: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: dist.broadcast(p, 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: dist.broadcast(p, 1: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return torch.distributed.broadcast(tensor=tensor, File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return func(*args, **kwargs)return func(*args, **kwargs) 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 5: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: return func(*args, **kwargs) File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 6: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 5: return func(*args, **kwargs) 1: RuntimeError: NCCL communicator was aborted on rank 8. Original reason for failure was: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805989 milliseconds before timing out. 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: work = group.broadcast([tensor], opts) 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: RuntimeError: NCCL communicator was aborted on rank 10. Original reason for failure was: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: work = group.broadcast([tensor], opts) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: RuntimeError : work = group.broadcast([tensor], opts)NCCL communicator was aborted on rank 11. Original reason for failure was: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 5: return torch.distributed.broadcast(tensor=tensor, 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts)work = group.broadcast([tensor], opts) 1: RuntimeError 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: : NCCL communicator was aborted on rank 13. Original reason for failure was: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: RuntimeErrorRuntimeError: : NCCL communicator was aborted on rank 12. Original reason for failure was: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out.NCCL communicator was aborted on rank 15. Original reason for failure was: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 5: return torch.distributed.broadcast(tensor=tensor, 1: work = group.broadcast([tensor], opts) 6: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: RuntimeError: NCCL communicator was aborted on rank 9. Original reason for failure was: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 5: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 1: work = group.broadcast([tensor], opts) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: RuntimeError: NCCL communicator was aborted on rank 14. Original reason for failure was: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 6: return torch.distributed.broadcast(tensor=tensor, 5: work = group.broadcast([tensor], opts) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: RuntimeError: NCCL communicator was aborted on rank 41. Original reason for failure was: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 5: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 50. Original reason for failure was: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: RuntimeError: NCCL communicator was aborted on rank 47. Original reason for failure was: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 54. Original reason for failure was: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 5: work = group.broadcast([tensor], opts) 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 40. Original reason for failure was: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: RuntimeError: NCCL communicator was aborted on rank 46. Original reason for failure was: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 42. Original reason for failure was: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out.work = group.broadcast([tensor], opts) 5: 5: RuntimeError: NCCL communicator was aborted on rank 43. Original reason for failure was: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 6: RuntimeError: NCCL communicator was aborted on rank 48. Original reason for failure was: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 53. Original reason for failure was: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 6: work = group.broadcast([tensor], opts)RuntimeError 6: : NCCL communicator was aborted on rank 52. Original reason for failure was: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805958 milliseconds before timing out. 5: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 55. Original reason for failure was: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 44. Original reason for failure was: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 6: RuntimeError: NCCL communicator was aborted on rank 49. Original reason for failure was: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 51. Original reason for failure was: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 0: Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805964 milliseconds before timing out. 4: Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 4: main() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 37. Original reason for failure was: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 5: main() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 45. Original reason for failure was: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 7: terminate called after throwing an instance of 'std::runtime_error' 7: terminate called after throwing an instance of ' what(): std::runtime_error[Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out.' 7: 7: what(): Fatal Python error: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out.Aborted 7: 7: 7: Thread 0x000014b57592b4c0 (most recent call first): 7: 7: Fatal Python error: Aborted 7: 7: Thread 0x0000145fc31964c0 (most recent call first): 7: 4: terminate called after throwing an instance of 'std::runtime_error' 4: what(): [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: Fatal Python error: Aborted 4: 4: Thread 0x000014f4a2c2e4c0 (most recent call first): 4: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805964 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x0000153171c624c0 (most recent call first): 0: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x0000153e2cc1b4c0 (most recent call first): 5: 2: terminate called after throwing an instance of 'std::runtime_error' 2: terminate called after throwing an instance of 'std::runtime_error' 2: what(): [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: what(): [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: Fatal Python error: Aborted 2: 2: Thread 0x00001476424cb4c0Fatal Python error: (most recent call first): 2: Aborted 2: 2: 2: Thread 0x0000150a034324c0 (most recent call first): 2: 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805982 milliseconds before timing out. 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: Fatal Python error: Aborted 3: 3: Thread 0x0000153fe29044c0 (most recent call first): 3: 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: Fatal Python error: Aborted 3: 3: Thread 0x000014b6f5b024c0 (most recent call first): 3: 3: Fatal Python error: Aborted 3: 3: Thread 0x000014980f92a4c0 (most recent call first): 3: 3: Fatal Python error: Aborted 3: 3: Thread 0x000014b2acfc74c0 (most recent call first): 3: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805943 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x00001488e68ea4c0 (most recent call first): 7: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805950 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x000015042fdda4c0 (most recent call first): 7: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x000014e55e5be4c0 (most recent call first): 7: 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: terminate called after throwing an instance of 'std::runtime_error' 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 4: what(): [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 4: terminate called after throwing an instance of 'Fatal Python error: std::runtime_errorAbortedterminate called after throwing an instance of '' 4: 4: 4: std::runtime_error' 4: Thread 0x000014d060af24c0 (most recent call first): 4: 3: Fatal Python error: Aborted 3: 4: what(): what(): [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out.terminate called after throwing an instance of '[Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 3: Thread 0x0000148c9c44b4c0 (most recent call first): 3: 4: std::runtime_error 4: ' 4: terminate called after throwing an instance of 'std::runtime_error' 4: what(): [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: Fatal Python error: Fatal Python error: AbortedAborted 4: 4: 4: 4: Thread 0x what(): Thread 0x00001508f3f0c4c0[Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. (most recent call first): 4: 00001483ab0c54c0 4: (most recent call first): 4: 4: 4: Fatal Python error: Abortedterminate called after throwing an instance of ' 4: 4: std::runtime_error' 4: Thread 0x000014763f1fc4c0 (most recent call first): 4: 4: what(): Fatal Python error: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out.Aborted 4: 4: 4: Thread 0x00001517012504c0 (most recent call first): 4: 4: Fatal Python error: Aborted 4: 4: Thread 0x000014c0dfc254c0 (most recent call first): 4: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x000014cf8120b4c0 (most recent call first): 0: 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: terminate called after throwing an instance of '[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 2: std::runtime_error[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: ' 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 2: what(): [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: terminate called after throwing an instance of 'std::runtime_error' 2: terminate called after throwing an instance of 'std::runtime_error' 2: Fatal Python error: Aborted 2: 2: what(): Thread 0x[Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out.0000152d924604c0 what(): 2: (most recent call first): 2: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: Fatal Python error: Aborted 2: 2: Fatal Python error: terminate called after throwing an instance of 'Thread 0xAbortedstd::runtime_error00001504c0ae74c0 2: 2: ' 2: (most recent call first): 2: 2: Thread 0x0000145da05aa4c0 (most recent call first): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py"terminate called after throwing an instance of ', line 1404 in std::runtime_errorbroadcast' 2: what(): 2: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. File " 2: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70 in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231 in broadcast 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line terminate called after throwing an instance of '126 what(): std::runtime_error in [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out.' 2: log_wrapper 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py"Fatal Python error: , line 1007Aborted in 2: 2: _broadcast_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092 in _configure_distributed_model 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py" what(): , line 287 in [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out.__init__ 2: 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py"Fatal Python error: , line 59Aborted in 2: 2: __init__ 2: File Thread 0x00001456785424c0 (most recent call first): 2: 2: Fatal Python error: Aborted 2: 2: Thread 0x0000145601c094c0 (most recent call first): 2: 2: "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137 in initialize 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424 in setup_model_and_optimizer 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141 in pretrain 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231 in main 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346 in wrapper 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235 in 2: Thread 0x000014f7d63664c0 (most recent call first): 2: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x0000145dabeac4c0 (most recent call first): 5: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x000014fd53b624c0 (most recent call first): 7: 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: terminate called after throwing an instance of 'std::runtime_error' 1: terminate called after throwing an instance of 'std::runtime_error' 1: what(): [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: what(): [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805989 milliseconds before timing out. 1: Fatal Python error: Aborted 1: 1: Fatal Python error: Aborted 1: 1: Thread 0x000014d44428b4c0 (most recent call first): 1: 1: Thread 0x000014f31ef3b4c0 (most recent call first): 1: 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 6: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 6: terminate called after throwing an instance of 'std::runtime_error' 6: what(): [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: terminate called after throwing an instance of 'terminate called after throwing an instance of 'std::runtime_error' 6: std::runtime_error' 6: what(): [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out.Fatal Python error: 6: Aborted what(): 6: 6: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: Thread 0x000014abdd3564c0 (most recent call first): 6: 6: terminate called after throwing an instance of 'std::runtime_error' 6: Fatal Python error: terminate called after throwing an instance of 'AbortedFatal Python error: 6: 6: std::runtime_errorAbortedterminate called after throwing an instance of ' 6: 6: ' 6: Thread 0xstd::runtime_error000014a14ba254c0Thread 0x' 6: (most recent call first): 6: 00001456f51904c0 6: what(): (most recent call first): 6: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: 6: what(): terminate called after throwing an instance of '[Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 6: std::runtime_error' 6: what(): [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805958 milliseconds before timing out. 6: Fatal Python error: Aborted 6: 6: terminate called after throwing an instance of 'Thread 0xstd::runtime_error000014df375524c0 (most recent call first): 6: ' 6: 6: Fatal Python error: Aborted 6: 6: Fatal Python error: Aborted 6: 6: Thread 0xThread 0x00001537f073a4c0 (most recent call first): 6: 000014cde78454c0 6: what(): (most recent call first): 6: 6: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: Fatal Python error: Aborted 6: 6: Thread 0x00001494b5a2d4c0 (most recent call first): 6: 6: what(): [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: Fatal Python error: Aborted 6: 6: Thread 0x000014ff3524c4c0 (most recent call first): 6: 7: Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 235, in 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 61. Original reason for failure was: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: terminate called after throwing an instance of 'std::runtime_error' 1: terminate called after throwing an instance of 'std::runtime_error' 1: what(): [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: what(): [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: Fatal Python error: Aborted 1: 1: Thread 0x0000153b7f0784c0 (most recent call first): 1: 1: Fatal Python error: Aborted 1: 1: Thread 0x00001485f56e54c0 (most recent call first): 1: 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: terminate called after throwing an instance of 'std::runtime_error' 1: what(): [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: Fatal Python error: Aborted 1: 1: Thread 0x000014ed56dab4c0 (most recent call first): 1: 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: terminate called after throwing an instance of 'std::runtime_error' 1: what(): [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: Fatal Python error: Aborted 1: 1: Thread 0x00001492a35e04c0 (most recent call first): 1: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x0000145c01ea34c0 (most recent call first): 0: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x000014ef358c14c0 (most recent call first): 5: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x0000151ecc4674c0 (most recent call first): 5: 4: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 4: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 4: terminate called after throwing an instance of 'std::runtime_error' 4: what(): [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806000 milliseconds before timing out. 4: Fatal Python error: Aborted 4: 4: Thread 0x000015024c6774c0 (most recent call first): 4: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x00001526794e14c0 (most recent call first): 5: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x00001491e460a4c0 (most recent call first): 5: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x000014651dfae4c0 (most recent call first): 0: 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 1: terminate called after throwing an instance of 'std::runtime_error' 1: what(): [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: terminate called after throwing an instance of 'std::runtime_error' 1: Fatal Python error: Aborted 1: 1: Thread 0x000014fe16bc64c0 (most recent call first): 1: what(): 1: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 1: Fatal Python error: Aborted 1: 1: Thread 0x0000149bff33c4c0 (most recent call first): 1: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x000014c0b43904c0 (most recent call first): 0: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: terminate called after throwing an instance of 'Fatal Python error: std::runtime_errorAborted' 0: 0: 0: Thread 0x000014c995eaf4c0 (most recent call first): 0: 0: what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x000014cf151bf4c0 (most recent call first): 0: 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 5: terminate called after throwing an instance of 'std::runtime_error' 5: terminate called after throwing an instance of 'std::runtime_error' 5: what(): [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: what(): [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: Fatal Python error: Aborted 5: 5: Thread 0x00001478a86444c0 (most recent call first): 5: 5: Fatal Python error: Aborted 5: 5: Thread 0x000014b96def24c0 (most recent call first): 5: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x0000146d240d84c0 (most recent call first): 7: 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 3: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 3: terminate called after throwing an instance of 'std::runtime_error' 3: terminate called after throwing an instance of 'std::runtime_error' 3: what(): what(): [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out.[Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 3: terminate called after throwing an instance of 'std::runtime_error' 3: Fatal Python error: Aborted 3: 3: Fatal Python error: Aborted 3: 3: Thread 0x what(): 00001515b078b4c0 (most recent call first): 3: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: Thread 0x 3: 0000151c9b3a44c0 (most recent call first): 3: 3: Fatal Python error: Aborted 3: 3: Thread 0x000014b8f9da74c0 (most recent call first): 3: 7: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 7: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 7: terminate called after throwing an instance of 'std::runtime_error' 7: what(): [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: Fatal Python error: Aborted 7: 7: Thread 0x000014dd6c3ca4c0 (most recent call first): 7: 0: [E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. 0: [E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down. 0: terminate called after throwing an instance of 'std::runtime_error' 0: what(): [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805997 milliseconds before timing out. 0: Fatal Python error: Aborted 0: 0: Thread 0x000014e69b7a84c0 (most recent call first): 0: 2: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 79242) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 7: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 78891) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 5: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 82244) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 4: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 83777) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 3: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 79524) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 1: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 130831) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 0: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 39792) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 6: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 80705) of binary: /pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/bin/python 3: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/0/error.json) 1: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/0/error.json) 3: Traceback (most recent call last): 3: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 1: Traceback (most recent call last): 1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 3: return _run_code(code, main_globals, None, 3: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 1: return _run_code(code, main_globals, None, 1: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 0: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/0/error.json) 0: Traceback (most recent call last): 1: exec(code, run_globals) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 3: exec(code, run_globals) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 0: return _run_code(code, main_globals, None, 0: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 0: exec(code, run_globals) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 3: main() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/0/error.json) 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 7: Traceback (most recent call last): 7: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 1: main() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return _run_code(code, main_globals, None, 7: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 0: main() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: exec(code, run_globals) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 3: run(args) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 1: run(args) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 3: elastic_launch( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 7: main() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: run(args) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 1: elastic_launch( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 0: elastic_launch( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 3: return launch_agent(self._config, self._entrypoint, list(args)) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 1: return launch_agent(self._config, self._entrypoint, list(args)) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 3: raise ChildFailedError( 7: run(args) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 0: return launch_agent(self._config, self._entrypoint, list(args)) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 3: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 3: ============================================================ 3: Megatron-DeepSpeed/pretrain_gpt.py FAILED 3: ------------------------------------------------------------ 3: Failures: 3: [1]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 25 (local_rank: 1) 3: exitcode : -6 (pid: 79525) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/1/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: elastic_launch( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 1: raise ChildFailedError( 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 25. Original reason for failure was: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: 3: [2]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 26 (local_rank: 2) 3: exitcode : -6 (pid: 79526) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/2/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 1: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 1: ============================================================ 1: Megatron-DeepSpeed/pretrain_gpt.py FAILED 1: ------------------------------------------------------------ 1: Failures: 1: [1]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 9 (local_rank: 1) 1: exitcode : -6 (pid: 130832) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/1/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: raise ChildFailedError( 0: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 0: ============================================================ 0: Megatron-DeepSpeed/pretrain_gpt.py FAILED 0: ------------------------------------------------------------ 0: Failures: 0: [1]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 1 (local_rank: 1) 0: exitcode : -6 (pid: 39793) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/1/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 26. Original reason for failure was: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805982 milliseconds before timing out. 3: 3: [3]: 3: time : 2023-03-17_10:58:00 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 3: host : nid005361 3: rank : 27 (local_rank: 3) 3: exitcode : -6 (pid: 79527) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/3/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: return launch_agent(self._config, self._entrypoint, list(args)) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 9. Original reason for failure was: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: 1: [2]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 10 (local_rank: 2) 1: exitcode : -6 (pid: 130833) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/2/error.json 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 1. Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: [2]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 2 (local_rank: 2) 0: exitcode : -6 (pid: 39794) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/2/error.json 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 27. Original reason for failure was: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 3: [4]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 28 (local_rank: 4) 3: exitcode : -6 (pid: 79528) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/4/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 7: raise ChildFailedError( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 10. Original reason for failure was: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 1: 1: [3]: 1: time : 2023-03-17_10:58:00 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: [3]: 0: time : 2023-03-17_10:58:00 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 7: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 7: ============================================================ 7: Megatron-DeepSpeed/pretrain_gpt.py FAILED 7: ------------------------------------------------------------ 7: Failures: 7: [1]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 7: rank : 57 (local_rank: 1) 7: exitcode : -6 (pid: 78892) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/1/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: host : nid005359 1: rank : 11 (local_rank: 3) 1: exitcode : -6 (pid: 130834) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/3/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: host : nid005358 0: rank : 3 (local_rank: 3) 0: exitcode : -6 (pid: 39795) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/3/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 28. Original reason for failure was: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 3: [5]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 29 (local_rank: 5) 3: exitcode : -6 (pid: 79529) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/5/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 57. Original reason for failure was: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: 7: [2]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 7: rank : 58 (local_rank: 2) 7: exitcode : -6 (pid: 78893) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/2/error.json 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 11. Original reason for failure was: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 1: [4]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 12 (local_rank: 4) 1: exitcode : -6 (pid: 130835) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/4/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: [4]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 4 (local_rank: 4) 0: exitcode : -6 (pid: 39796) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/4/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 3: return func(*args, **kwargs) 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 29. Original reason for failure was: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: 3: [6]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 3: rank : 30 (local_rank: 6) 3: exitcode : -6 (pid: 79530) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/6/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 4: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/0/error.json) 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 58. Original reason for failure was: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: 7: [3]: 7: time : 2023-03-17_10:58:00 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 12. Original reason for failure was: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 1: [5]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 13 (local_rank: 5) 1: exitcode : -6 (pid: 130836) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/5/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 4. Original reason for failure was: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: [5]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 5 (local_rank: 5) 0: exitcode : -6 (pid: 39797) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/5/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: host : nid005365 7: rank : 59 (local_rank: 3) 7: exitcode : -6 (pid: 78894) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/3/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: RuntimeError: NCCL communicator was aborted on rank 30. Original reason for failure was: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 3: [7]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 31 (local_rank: 7) 3: exitcode : -6 (pid: 79531) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/7/error.json 3: traceback : Traceback (most recent call last): 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 1: return func(*args, **kwargs) 0: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 4: Traceback (most recent call last): 4: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 13. Original reason for failure was: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805973 milliseconds before timing out. 1: 1: [6]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 5. Original reason for failure was: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805964 milliseconds before timing out. 0: 0: [6]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return _run_code(code, main_globals, None, 4: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 59. Original reason for failure was: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: 7: [4]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 7: rank : 60 (local_rank: 4) 7: exitcode : -6 (pid: 78895) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/4/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 1: rank : 14 (local_rank: 6) 1: exitcode : -6 (pid: 130837) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/6/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: rank : 6 (local_rank: 6) 0: exitcode : -6 (pid: 39798) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/6/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 31. Original reason for failure was: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805987 milliseconds before timing out. 3: 3: ------------------------------------------------------------ 3: Root Cause (first observed failure): 3: [0]: 3: time : 2023-03-17_10:58:00 3: host : nid005361 3: rank : 24 (local_rank: 0) 3: exitcode : -6 (pid: 79524) 4: exec(code, run_globals) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 3: error_file: /tmp/torchelastic_1_4jdb1b/none_fx92baoh/attempt_0/0/error.json 3: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 3: return f(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 3: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 3: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 3: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 3: engine = PipelineEngine(args=args, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 3: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 3: self._configure_distributed_model(model) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 3: self._broadcast_model() 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 60. Original reason for failure was: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805943 milliseconds before timing out. 7: 7: [5]: 7: time : 2023-03-17_10:58:01 7: host : nid005365 7: rank : 61 (local_rank: 5) 7: exitcode : -6 (pid: 78896) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/5/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 1: RuntimeError: NCCL communicator was aborted on rank 14. Original reason for failure was: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 1: [7]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 15 (local_rank: 7) 1: exitcode : -6 (pid: 130838) 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/7/error.json 1: traceback : Traceback (most recent call last): 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: RuntimeError: NCCL communicator was aborted on rank 6. Original reason for failure was: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805997 milliseconds before timing out. 0: 0: [7]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 7 (local_rank: 7) 0: exitcode : -6 (pid: 39799) 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/7/error.json 0: traceback : Traceback (most recent call last): 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 3: dist.broadcast(p, 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 3: return func(*args, **kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 3: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 3: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 3: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 3: work = group.broadcast([tensor], opts) 3: RuntimeError: NCCL communicator was aborted on rank 24. Original reason for failure was: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805983 milliseconds before timing out. 3: 3: ============================================================ 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return func(*args, **kwargs) 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 15. Original reason for failure was: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805988 milliseconds before timing out. 1: 1: ------------------------------------------------------------ 1: Root Cause (first observed failure): 1: [0]: 1: time : 2023-03-17_10:58:00 1: host : nid005359 1: rank : 8 (local_rank: 0) 1: exitcode : -6 (pid: 130831) 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 7. Original reason for failure was: [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: ------------------------------------------------------------ 0: Root Cause (first observed failure): 0: [0]: 0: time : 2023-03-17_10:58:00 0: host : nid005358 0: rank : 0 (local_rank: 0) 0: exitcode : -6 (pid: 39792) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 61. Original reason for failure was: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: 7: [6]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 1: error_file: /tmp/torchelastic_9id5al_8/none_xu0ou91c/attempt_0/0/error.json 1: traceback : Traceback (most recent call last): 0: error_file: /tmp/torchelastic_y1xliq25/none_3q568_3e/attempt_0/0/error.json 0: traceback : Traceback (most recent call last): 7: rank : 62 (local_rank: 6) 7: exitcode : -6 (pid: 78897) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/6/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 1: return f(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 1: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 1: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 1: model, optimizer, _, lr_scheduler = deepspeed.initialize( 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 0: return f(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 0: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 0: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 0: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 1: engine = PipelineEngine(args=args, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 1: super().__init__(*super_args, **super_kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 1: self._configure_distributed_model(model) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 1: self._broadcast_model() 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 0: engine = PipelineEngine(args=args, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 0: super().__init__(*super_args, **super_kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 0: self._configure_distributed_model(model) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 0: self._broadcast_model() 4: main() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 1: dist.broadcast(p, 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 1: return func(*args, **kwargs) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 1: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 1: return torch.distributed.broadcast(tensor=tensor, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 0: dist.broadcast(p, 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 0: return func(*args, **kwargs) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 0: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 0: return torch.distributed.broadcast(tensor=tensor, 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 1: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 1: work = group.broadcast([tensor], opts) 1: RuntimeError: NCCL communicator was aborted on rank 8. Original reason for failure was: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=42, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805989 milliseconds before timing out. 1: 1: ============================================================ 0: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 0: work = group.broadcast([tensor], opts) 0: RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=50, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805990 milliseconds before timing out. 0: 0: ============================================================ 7: RuntimeError: NCCL communicator was aborted on rank 62. Original reason for failure was: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805953 milliseconds before timing out. 7: 7: [7]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 7: rank : 63 (local_rank: 7) 7: exitcode : -6 (pid: 78898) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/7/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 63. Original reason for failure was: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805952 milliseconds before timing out. 7: 7: ------------------------------------------------------------ 7: Root Cause (first observed failure): 7: [0]: 7: time : 2023-03-17_10:58:00 7: host : nid005365 7: rank : 56 (local_rank: 0) 7: exitcode : -6 (pid: 78891) 7: error_file: /tmp/torchelastic_rchtnbjh/none_hhbsa1ko/attempt_0/0/error.json 7: traceback : Traceback (most recent call last): 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 7: return f(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 7: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 7: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 7: model, optimizer, _, lr_scheduler = deepspeed.initialize( 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 7: engine = PipelineEngine(args=args, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 7: super().__init__(*super_args, **super_kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 7: self._configure_distributed_model(model) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 7: self._broadcast_model() 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 7: dist.broadcast(p, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 7: return func(*args, **kwargs) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 7: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 7: return torch.distributed.broadcast(tensor=tensor, 7: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 7: work = group.broadcast([tensor], opts) 7: RuntimeError: NCCL communicator was aborted on rank 56. Original reason for failure was: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805950 milliseconds before timing out. 7: 7: ============================================================ 4: run(args) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 4: elastic_launch( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 4: return launch_agent(self._config, self._entrypoint, list(args)) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 4: raise ChildFailedError( 4: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 4: ============================================================ 4: Megatron-DeepSpeed/pretrain_gpt.py FAILED 4: ------------------------------------------------------------ 4: Failures: 4: [1]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 33 (local_rank: 1) 4: exitcode : -6 (pid: 83778) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/1/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 33. Original reason for failure was: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 4: 4: [2]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 34 (local_rank: 2) 4: exitcode : -6 (pid: 83779) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/2/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 34. Original reason for failure was: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: 4: [3]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 35 (local_rank: 3) 4: exitcode : -6 (pid: 83780) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/3/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 35. Original reason for failure was: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: 4: [4]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 36 (local_rank: 4) 4: exitcode : -6 (pid: 83781) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/4/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 36. Original reason for failure was: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 4: 4: [5]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 37 (local_rank: 5) 4: exitcode : -6 (pid: 83782) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/5/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 37. Original reason for failure was: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805975 milliseconds before timing out. 4: 4: [6]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 38 (local_rank: 6) 4: exitcode : -6 (pid: 83783) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/6/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 38. Original reason for failure was: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1806000 milliseconds before timing out. 4: 4: [7]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 39 (local_rank: 7) 4: exitcode : -6 (pid: 83784) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/7/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 39. Original reason for failure was: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: 4: ------------------------------------------------------------ 4: Root Cause (first observed failure): 4: [0]: 4: time : 2023-03-17_10:58:00 4: host : nid005362 4: rank : 32 (local_rank: 0) 4: exitcode : -6 (pid: 83777) 4: error_file: /tmp/torchelastic_0mkj7gbo/none_tfo6glno/attempt_0/0/error.json 4: traceback : Traceback (most recent call last): 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 4: return f(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 4: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 4: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 4: model, optimizer, _, lr_scheduler = deepspeed.initialize( 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 4: engine = PipelineEngine(args=args, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 4: super().__init__(*super_args, **super_kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 4: self._configure_distributed_model(model) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 4: self._broadcast_model() 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 4: dist.broadcast(p, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 4: return func(*args, **kwargs) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 4: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 4: return torch.distributed.broadcast(tensor=tensor, 4: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 4: work = group.broadcast([tensor], opts) 4: RuntimeError: NCCL communicator was aborted on rank 32. Original reason for failure was: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805974 milliseconds before timing out. 4: 4: ============================================================ 6: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/0/error.json) 6: Traceback (most recent call last): 6: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 6: return _run_code(code, main_globals, None, 6: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 6: exec(code, run_globals) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 6: main() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 6: run(args) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 6: elastic_launch( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 6: return launch_agent(self._config, self._entrypoint, list(args)) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 6: raise ChildFailedError( 6: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 6: ============================================================ 6: Megatron-DeepSpeed/pretrain_gpt.py FAILED 6: ------------------------------------------------------------ 6: Failures: 6: [1]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 49 (local_rank: 1) 6: exitcode : -6 (pid: 80706) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/1/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 49. Original reason for failure was: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: 6: [2]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 50 (local_rank: 2) 6: exitcode : -6 (pid: 80707) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/2/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 50. Original reason for failure was: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 6: 6: [3]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 51 (local_rank: 3) 6: exitcode : -6 (pid: 80708) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/3/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 51. Original reason for failure was: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: 6: [4]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 52 (local_rank: 4) 6: exitcode : -6 (pid: 80709) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/4/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 52. Original reason for failure was: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805958 milliseconds before timing out. 6: 6: [5]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 53 (local_rank: 5) 6: exitcode : -6 (pid: 80710) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/5/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 53. Original reason for failure was: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: 6: [6]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 54 (local_rank: 6) 6: exitcode : -6 (pid: 80711) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/6/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/0/error.json) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 54. Original reason for failure was: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805971 milliseconds before timing out. 6: 6: [7]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 55 (local_rank: 7) 6: exitcode : -6 (pid: 80712) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/7/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: Traceback (most recent call last): 2: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 2: return _run_code(code, main_globals, None, 2: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 55. Original reason for failure was: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 6: 6: ------------------------------------------------------------ 6: Root Cause (first observed failure): 6: [0]: 6: time : 2023-03-17_10:58:00 6: host : nid005364 6: rank : 48 (local_rank: 0) 6: exitcode : -6 (pid: 80705) 6: error_file: /tmp/torchelastic_re932cu8/none_2y4skvxw/attempt_0/0/error.json 6: traceback : Traceback (most recent call last): 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 6: return f(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 6: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 6: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 6: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: exec(code, run_globals) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 6: engine = PipelineEngine(args=args, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 6: super().__init__(*super_args, **super_kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 6: self._configure_distributed_model(model) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 6: self._broadcast_model() 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 6: dist.broadcast(p, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 6: return func(*args, **kwargs) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 6: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 6: return torch.distributed.broadcast(tensor=tensor, 6: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 6: work = group.broadcast([tensor], opts) 6: RuntimeError: NCCL communicator was aborted on rank 48. Original reason for failure was: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805972 milliseconds before timing out. 6: 6: ============================================================ 2: main() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 2: run(args) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 2: elastic_launch( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 2: return launch_agent(self._config, self._entrypoint, list(args)) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 2: raise ChildFailedError( 2: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 2: ============================================================ 2: Megatron-DeepSpeed/pretrain_gpt.py FAILED 2: ------------------------------------------------------------ 2: Failures: 2: [1]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 17 (local_rank: 1) 2: exitcode : -6 (pid: 79243) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/1/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 17. Original reason for failure was: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805970 milliseconds before timing out. 2: 2: [2]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 18 (local_rank: 2) 2: exitcode : -6 (pid: 79244) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/2/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 18. Original reason for failure was: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: [3]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 19 (local_rank: 3) 2: exitcode : -6 (pid: 79245) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/3/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 19. Original reason for failure was: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: [4]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 20 (local_rank: 4) 2: exitcode : -6 (pid: 79246) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/4/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 20. Original reason for failure was: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: [5]: 2: time : 2023-03-17_10:58:02 2: host : nid005360 2: rank : 21 (local_rank: 5) 2: exitcode : -6 (pid: 79247) 2: error_file: 2: traceback : Signal 6 (SIGABRT) received by PID 79247 2: [6]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 22 (local_rank: 6) 2: exitcode : -6 (pid: 79248) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/6/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 22. Original reason for failure was: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: [7]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 23 (local_rank: 7) 2: exitcode : -6 (pid: 79249) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/7/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 23. Original reason for failure was: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805965 milliseconds before timing out. 2: 2: ------------------------------------------------------------ 2: Root Cause (first observed failure): 2: [0]: 2: time : 2023-03-17_10:58:00 2: host : nid005360 2: rank : 16 (local_rank: 0) 2: exitcode : -6 (pid: 79242) 2: error_file: /tmp/torchelastic_2hayt3oa/none_85zf9u3h/attempt_0/0/error.json 2: traceback : Traceback (most recent call last): 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 2: return f(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 2: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 2: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 2: model, optimizer, _, lr_scheduler = deepspeed.initialize( 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 2: engine = PipelineEngine(args=args, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 2: super().__init__(*super_args, **super_kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 2: self._configure_distributed_model(model) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 2: self._broadcast_model() 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 2: dist.broadcast(p, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 2: return func(*args, **kwargs) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 2: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 2: return torch.distributed.broadcast(tensor=tensor, 2: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 2: work = group.broadcast([tensor], opts) 2: RuntimeError: NCCL communicator was aborted on rank 16. Original reason for failure was: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805966 milliseconds before timing out. 2: 2: ============================================================ 5: ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:no error file defined for parent, to copy child error file (/tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/0/error.json) 5: Traceback (most recent call last): 5: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 197, in _run_module_as_main 5: return _run_code(code, main_globals, None, 5: File "/opt/cray/pe/python/3.9.12.1/lib/python3.9/runpy.py", line 87, in _run_code 5: exec(code, run_globals) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in 5: main() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main 5: run(args) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run 5: elastic_launch( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__ 5: return launch_agent(self._config, self._entrypoint, list(args)) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent 5: raise ChildFailedError( 5: torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 5: ============================================================ 5: Megatron-DeepSpeed/pretrain_gpt.py FAILED 5: ------------------------------------------------------------ 5: Failures: 5: [1]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 41 (local_rank: 1) 5: exitcode : -6 (pid: 82245) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/1/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 41. Original reason for failure was: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: 5: [2]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 42 (local_rank: 2) 5: exitcode : -6 (pid: 82246) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/2/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 42. Original reason for failure was: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: 5: [3]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 43 (local_rank: 3) 5: exitcode : -6 (pid: 82247) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/3/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 43. Original reason for failure was: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: 5: [4]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 44 (local_rank: 4) 5: exitcode : -6 (pid: 82248) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/4/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 44. Original reason for failure was: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: 5: [5]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 45 (local_rank: 5) 5: exitcode : -6 (pid: 82249) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/5/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 45. Original reason for failure was: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: 5: [6]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 46 (local_rank: 6) 5: exitcode : -6 (pid: 82250) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/6/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 46. Original reason for failure was: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: 5: [7]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 47 (local_rank: 7) 5: exitcode : -6 (pid: 82251) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/7/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 47. Original reason for failure was: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805960 milliseconds before timing out. 5: 5: ------------------------------------------------------------ 5: Root Cause (first observed failure): 5: [0]: 5: time : 2023-03-17_10:58:00 5: host : nid005363 5: rank : 40 (local_rank: 0) 5: exitcode : -6 (pid: 82244) 5: error_file: /tmp/torchelastic_cnbvbhpd/none_q_n5z7ff/attempt_0/0/error.json 5: traceback : Traceback (most recent call last): 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper 5: return f(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/pretrain_gpt.py", line 231, in main 5: pretrain(train_valid_test_datasets_provider, model_provider, forward_step, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 141, in pretrain 5: model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/Megatron-DeepSpeed/megatron/training.py", line 424, in setup_model_and_optimizer 5: model, optimizer, _, lr_scheduler = deepspeed.initialize( 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/__init__.py", line 137, in initialize 5: engine = PipelineEngine(args=args, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 59, in __init__ 5: super().__init__(*super_args, **super_kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 287, in __init__ 5: self._configure_distributed_model(model) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1092, in _configure_distributed_model 5: self._broadcast_model() 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1007, in _broadcast_model 5: dist.broadcast(p, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 126, in log_wrapper 5: return func(*args, **kwargs) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/comm.py", line 231, in broadcast 5: return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op) 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/deepspeed/comm/torch.py", line 70, in broadcast 5: return torch.distributed.broadcast(tensor=tensor, 5: File "/pfs/lustrep4/scratch/project_462000119/muennighoff/nov-2022-bettercom/venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast 5: work = group.broadcast([tensor], opts) 5: RuntimeError: NCCL communicator was aborted on rank 40. Original reason for failure was: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=41, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1805961 milliseconds before timing out. 5: 5: ============================================================ srun: error: nid005365: task 7: Exited with exit code 1 srun: launch/slurm: _step_signal: Terminating StepId=3328731.0 srun: error: nid005358: task 0: Exited with exit code 1 srun: error: nid005359: task 1: Exited with exit code 1 srun: error: nid005361: task 3: Exited with exit code 1 srun: error: nid005362: task 4: Exited with exit code 1 srun: error: nid005360: task 2: Exited with exit code 1 srun: error: nid005364: task 6: Exited with exit code 1 srun: error: nid005363: task 5: Exited with exit code 1